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METHOD AND APPARATUS FOR AN AVATAR USER INTERFACE SYSTEM 



FIELD OF THE INVENTION 

The present invention concerns methods and apparatus for an avatar 
5 user interface system to people, information, media and agents with 
photo-realistic avatars. 

BACKGROUND TO THE INVENTION 

It is well established in the marketplace that face to face 
10 communications have significant advantages over ways of communicating 
virtually such as video conference calls and audio conference calls* 
With the increasing globalisation of business and the shrinking 
timescales of new commercial initiatives, it is even more important to 
communicate well. But at the same time, the cost of travelling to 
15 face to face communication sessions is increasing. 

An alternative method of communicating is in a virtual world. Several 
companies have provided 3D worlds with avatars including Blaxxun 
(Germany) with its consumer world Cybertown. In these worlds, the 
20 user navigates his avatar into proximity with one or more avatars and 
chat then commences involving the owners of the avatars. User-driven 
gestures are incorporated. The avatars used in these virtual worlds 
are not photo-realistic representations of the person they represent. 

25 Photo-realistic avatars of people can be generated in Avatar Booths as 
disclosed in UK Patent GB 2336981. An ad hoc standards group called H- 
anim has drafted a version H-Anim 2001 for avatars that can be found 
on the world wide web at www.h-anim.orp . These photo-realistic avatars 
are also becoming anima-realistic : they can be animated realistically. 

30 Harold Sun and Dimitri Metaxas published a solution to generating 
life-like walking animation for an avatar automatically following a 
path in the proceedings of SIGGRAPH 2001 p 261-269. 

As well as moving anima-realistically , the avatars need to talk anima- 
35 realistically. Eric Cosatto and Hans Peter Graf in their paper 
% Sample -based Synthesis of Photo-Realistic Talking Heads' given at the 
Computer Animation conference Jun 8-10 1998 in Philadelphia show a 
system with a talking head speaking from a synthesis of text. Their 
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paper explains the conventional approach of generating lip movements 
from phonemes and how co-articulation is handled. 



The present invention aims to provide avatar user interface system 
5 means by which a user has a high sense of presence that overcomes some 
of the disadvantages of other communication methods. Embodiments of 
the present invention use photo-realistic avatars of the participants 
in the communication session to create a virtual communication room 
with high photo-realism and high anima-realism. Embodiments of the 
10 present invention provide an avatar user interface system in which a 
synchronous communication session can take place without the user 
needing to control the user interface manually and thus allowing the 
user to concentrate on communicating. Embodiments of the present 
invention provide an avatar user interface system in which multi- 
15 tasking can take place between multiple communication and information 
processing tasks. Embodiments of the present invention provide an 
avatar user interface system in which people and agents may 
communicate with each other. 

20 There is a multitude of applications of an avatar user interface 
system. Some examples of significant commercial applications are 
given below, 
conferences 
meetings 
25 - e- learning tutorials 

- product presentations 
exhibitions 

call switchboard 

multi- tasking communication tool 
30 - security 

interactive games 

- collaborative work 

shared space virtual reality 
social exercise 
35 - practicing 



SUMMARY OF THE INVENTION 

In accordance with one aspect of the present invention there is 

2 
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provided an apparatus for an avatar user interface system comprising: 
server means for serving the communication session; 
one or more computing appliance means; 

network means for joining said server means and said computing 
5 appliance means; 

avatar means for representing each user visually; and 

avatar user interface application means resident on each computing 

appliance means; 

operable by one or more users . 

10 

In accordance with this aspect of the present invention there is 
provided a method of communication between a plurality of users via an 
avatar user interface system comprising the steps of: 

joining a plurality of computing appliance means and a server means 
15 for serving the communications session to start a communication 

session by means of a network; 

viewing the avatars of the users involved in the communication 
session on the said plurality of computing appliance means; 
a user first communicating into a computing appliance; 
20 - one or more users receiving the first communication on one or more 
other computing appliances; 

avatars enacting the first communication on said computing 
appliances; 

a user responding to the first communication in a second 
25 communication; 

one or more users receiving the second communication on one or more 
other computing appliances; 

avatars enacting the second communication on said computing 
appliances; 

30 - continuing the exchange of communications until the session is 
finished; and 

terminating the joining of the computing appliance means and the 
server means for serving the communications session to terminate 
the communication session. 

35 

In accordance with a further aspect of the present invention there is 
provided a method of communicating between at least one user and at 

3 



WO 03/058518 PCT/GB03/00031 

least one avatar agent via an avatar user interface system comprising 
the steps of : 

joining one or more computing appliance means, an avatar agent 
hosting server means hosting one or more intelligent agent software 
units and a server means for serving the communications session to 
start a communication session by means of a network; 

viewing the avatars of the said avatar agents and said users 
involved in the communication session on the said computing 
appliance means; 

a user or an avatar agent first communicating; 

if there are one or more users who did not first communicate, then 
the one or more users who did not first communicate receive the 
first communication on one or more other computing appliances; 
avatars enacting the first communication on said computing 
appliances; 

if there are one or more avatar agents who did not first 
communicate, then the one or more avatar agents who did not first 
communicate receive the first communication ; 

a user or an avatar agent responding to the first communication in 
a second communication; 

one or more users or one or more avatar agents receiving the second 
communi ca t i on ; 

if there are one or more avatars receiving the second 
communication, then avatars enact the second communication on said 
computing appliances; 

continuing the exchange of communications until the session is 
finished; and 

terminating the joining of the computing appliance means, the 
avatar agent hosting server means and the server means for serving 
the communications session to terminate the communication session. 

In a further aspect, the present invention aims to provide an 
integrated multi-media communication system for use in a broad range 
of applications based around photo-realistic avatars for communication 
with people and intelligent agents in both synchronous and 
asynchronous ways that is supportive of multiple concurrent 
communication sessions and of switching between communication 
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In a further aspect, the present invention aims to provide a user 
interface system in which avatar means may be photo- realistic avatar 
5 means or parameter avatar means or animatable image avatar means. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Embodiments of the present invention will now be described, by way of 
example only, with reference to the accompanying drawings, in which: 

10 

Figure 1 is a block diagram of apparatus for an avatar user interface 
system in accordance with a first embodiment of the present invention; 

Figure 2 is a schematic diagram of an avatar; 

15 

Figure 3 is a block diagram of avatar visual types; 

Figure 4 is a block diagram for the reconstruction of a parameter 
avatar; 

20 

Figure 5 is an example table of avatar parameters; 

Figure 6 is a block diagram of apparatus for generating and editing a 
parameter avatar; 

25 

Figure 7 is a list of action impersonation parameters stored in the 
memory of a personal computer; 

Figure 7a is a flow diagram illustrating the process for defining 
30 action impersonation parameters and action impersonation rules for an 
activity; 

Figure 8 is a block diagram of apparatus for generating and editing 
action impersonation parameters; 

35 

Figure 9 is a schematic diagram of an avatar hosting server system; 

Figure 10 is a schematic diagram of an avatar number; 

5 
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Figure 11 is a block diagram of a personal computer with an avatar 
user interface; 

5 Figure 12 is a diagrammatic representation of avatar user interface 
functionality in an avatar conference application; 

Figure 13 is a block diagram of a presentation media window; 

10 Figure 14 is a block diagram of a whiteboard media window; 

Figure 15 is a representation of an example of a meeting room media 
window; 

15 Figures 16a, 16b, 16c and 16d are schematic diagrams to illustrate 
possible virtual camera positions in a virtual video conference; 

Figures 17a, 17b and 17c are schematics of three possible layouts in 
the meeting room media window; 

20 

Figure 18 is a plan view of the virtual meeting room illustrating 
possible virtual camera positions; 

Figure 19 is a set of four timelines of the camera shots during an 
25 avatar user interface session in four modes; 

Figures 2 0 is a block diagram of a software director and avatar engine 
player ; 

30 Figure 21 is a block diagram of events on a personal computer and a 
session server; 

Figures 22a, 22b, 22c, 22d and 22e are schematics of the five seating 
plans viewed by the five participants; 

35 

Figures 23 is a schematic of the audio mixer; 



6 
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Figures 24 is a schematic of the audio mixer for multiple 
conversations ; 



Figure 25 is a block diagram of a lip sync generator; 

5 

Figure 26 is a timeline of a lip sync generator; 

Figures 27a, 27b, 27c and 27d are diagrammatic representations of four 
lip sync animation types that can be used to animate a talking head; 

10 

Figure 2 8 is a flow diagram illustrating the steps involved in a lip 
sync generator; 

Figure 29 is a flow diagram illustrating the steps in the passage of 
15 sound through an avatar user interface system; 

Figure 3 0a is a spectrogram; 

Figure 3 0b is a graphical diagram of a spectrum; 

20 

Figure 31 is a block diagram of the session server system; 

Figure 32 is a block diagram of an apparatus for holding an avatar 
user interface session using voice and data networks in accordance 
25 with a second embodiment of the present invention; 

Figure 33 is a schematic diagram of an animatable image in accordance 
with a third embodiment of the present invention; 

30 Figure 34 is a schematic diagram of an animatable image avatar; 

Figure 3 5 is a schematic diagram of a set of four state images for the 
jaw and mouth segment ; 

35 Figure 36 is a tree diagram of the hierarchy of animatable avatar 
image components ; 

Figure 3 7 is a schematic diagram of an animatable image generator; 

7 
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Figure 38 is a schematic diagram of an apparatus for animatable image 
generation; 

5 Figure 39 is a block diagram of an avatar user interface system with 
multiple formats of avatar; 

Figure 40 is a schematic layout of an avatar user interface with 
attendee f unc t ional i ty ; 

10 

Figure 41 is a schematic layout of an apparatus for a multi-party 
location in an avatar user interface system in accordance with a 
fourth embodiment of the present invention; 

15 Figure 41a is a schematic of the 3D sound processing; 

Figure 42 is a representation of an example of the displayed avatar 
user interface with switchboard functionality in accordance with a 
fifth embodiment of the present invention; 

20 

Figure 43 is a block diagram of a multi-session server system; 

Figure 44 is a block diagram of a stand-alone avatar user interface 
system in accordance with a sixth embodiment of the present invention; 

25 

Figure 45 is a representation of an example of the avatar user 
interface system with extended exhibition functionality in accordance 
with a seventh embodiment of the present invention; 

30 Figure 4 6 is a block diagram of an avatar agent hosting system and 
intelligent agent software in accordance with an eighth embodiment of 
the present invention; 

Figure 47 is a block diagram of an apparatus for generating 
35 impersonation parameters; 

Figure 48 is a block diagram of the avatar user interface system with 
extended security functionality in accordance with a ninth embodiment 
of the present invention; 

8 
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Figure 49 is a block diagram of an avatar user interface system for 
interactive computer gaming in accordance with a tenth embodiment of 
the present invention; 

5 

Figure 50 is a schematic of an avatar user interface system for a six- 
sided cave in accordance with an eleventh embodiment of the present 
invention; 

10 Figure 51 is a schematic of an avatar user interface system for two 
caves connected by a network; 



Figure 52 is a schematic of an avatar user interface system comprising 
two exercise stations connected together by a network in accordance 
15 with a twelfth embodiment of the present invention; 

Figure 53 is a schematic of the display of an avatar user interface 
system with an avatar virtual environment as the background in 
accordance with a fourteenth embodiment of the present invention; 

20 

Figure 54 is a schematic of a terminal of an avatar user interface 
system including motion-tracking cameras in accordance with a 
fifteenth embodiment of the present invention; 



25 Figure 55 is a block diagram of apparatus for an avatar user interface 
system with multiple user devices; 

Figure 56 is a schematic of a display device consisting of a display 
screen, an AVE projector and a Presentation projector in accordance 
30 with a sixteenth embodiment of the present invention; 

Figure 57 is a schematic of a display device in which the AVE and 
Presentation projection means are combined into one physical unit; 



35 Figure 58 is a schematic of a multi-density display device comprising 
an area of low density pixels and an embedded area of high density 
pixels ; 



9 
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Figure 59 is a schematic of an avatar user interface system with a 
mixed audience of avatars of virtual users at various locations and 
physical users; 

5 Figure 60 is a block diagram of an apparatus for presentation 
preparation. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

10 FIRST EMBODIMENT 

Figure 1 is a block diagram of an apparatus for an avatar user 
interface system 261 in accordance with a first embodiment of the 
present invent ion . 

15 Avatar Conference application 

The avatar user interface system 261 invention can be embodied in many 
applications. The avatar user interface system 261 is disclosed in 
this first embodiment embodied as an avatar conference application. 
An avatar conference is an example of a communication session on an 
20 avatar user interface system 261. Further embodiments disclose the 
avatar user interface system 261 invention embodied in different 
applications . 

In this embodiment, the apparatus comprises two or more personal 
25 computers 3 with memory 345, display devices 264 and displayed avatar 
user interfaces 260 that are connected by a network 2 to a session 
server 1 with memory 346 using a standard avatar interface protocol 
300 and an avatar hosting server 4 containing a plurality of avatars 5 
and memory 344. 

30 

As will be described in detail below in accordance with the present 
invention avatars 5 representing the parties taking part in the avatar 
user interface session are stored on the avatar hosting server 4. The 
avatars 5 are transferred to the personal computers 3 across the 
35 network 2. The session server 1 mixes the voice streams from the 
personal computers 3 and returns them to the personal computers 3 . 
The avatars 5 are displayed in the displayed avatar user interfaces 
260 of the display devices 264 of the personal computers 3. 

10 
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Avatars 

Figure 2 is a schematic diagram of an avatar 5. The avatar 5 has an 
avatar identity 275 comprising an avatar number 8, a password 9 and a 

5 display permission flag 259. Associated with the avatar 5 are one or 
more types of data which may include: photo-realistic visual avatar 
data 340 , animatable image avatar segment data 395, other visual image 
data 3 96, avatar parameters 230, impersonation parameters 325, 
biometric data 317, intelligent agent software unit 320, billing data 

10 342 and personal data 341. The impersonation parameters 325 are of two 
types: voice impersonation parameters 331 and action impersonation 
parameters 332. Each set of data associated with the avatar 5 may be 
resident on different servers on the network 2 or servers on other 
networks that may be accessible via the network 2. 

15 

Figure 3 is a block diagram of avatar visual types . The visual 
component of an avatar 5 may be a 3D avatar 39 or an animatable avatar 
image 382 or another avatar type 239. There are two types of 3D 
avatar 39: a photo-realistic avatar 238 and a parameter avatar 232. 
20 An avatar 5 includes at least one of the photo-realistic visual avatar 
data 340 or the avatar parameters 230 or the animatable image avatar 
segment data 3 95 or the other visual image data 396 and any other or 
all of the other types of data. 

An avatar 5 comprising at least photo-realistic visual avatar data 340 
25 is referred to as a photo-realistic avatar 238. An avatar 5 comprising 
at least avatar parameters 230 is referred to as a parameter avatar 
232. An avatar 5 comprising at least animatable image avatar segment 
data 395 is referred to as an animatable image avatar 382. An avatar 
5 comprising at least either photo-realistic visual avatar data 340 or 
30 avatar parameters 230 is referred to as a 3D avatar 39. An avatar 5 
comprising at least other visual image data 3 96 is referred to as 
another avatar type 239. 

Photo -realistic Avatars 

35 Photo-realistic visual avatar data 340 is a computer model that 
represents an individual taking part in the avatar conference. It is 
photo-realistic. When viewed by a person who knows the individual 
that it represents, that photo-realistic visual avatar data 340 will 
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be recognisable as a photo-realistic avatar 238 of the individual in 
the same way that a photograph of an individual is recognisable by a 
person who knows the individual as being a photograph of an 
individual . 

5 

In this embodiment, the photo-realistic visual avatar data 34 0 is a 
three dimensional (3D) computer model. The structure of the photo- 
realistic visual avatar data 340 is similar in terms of its components 
to the draft H-Anim 2001 standard. In this embodiment, the external 

10 shape of the photo-realistic visual avatar data 340 is represented by 
polygonal meshes totalling approximately 6,000 polygons. A generic 
avatar topology is used in which every photo-realistic visual avatar 
data 340 of every person has the same number of polygons, whether the 
person is tall or short, fat or thin, male or female. Texture mapping 

15 is used to position images of the avatar over the polygons so that the 
avatar can be rendered to appear like the individual it represents. 
The compressed size of the photo-realistic visual avatar data's 
computer model is typically between 200 and 900 Kbytes. 

20 In this embodiment a subset of the full number of joints specified in 
h-anim is used; in particular, not all the joints in the back, the 
hands and the feet are modelled. If all the joints were used, there 
would be considerable extra computational cost for very little extra 
anima-realism of movement. 

25 

Parameter Avatars 

Photo-realistic visual avatar data 340 can be quite large and, on 
lower bandwidth connections, it can take a long time to download. For 
the avatar conference to feel right to the user, a person's avatar 

30 should be seen when he is speaking, rather than just heard as a 
disembodied voice. Ideally, the avatar should appear in the avatar 
conference at the same time as a person joins the conference. If it 
is known who will be in the conference when the conference is 
organised, then photo-realistic visual avatar data 340 can be sent out 

35 in advance of the start of the conference. However, if someone joins 
the conference without any notice, then it is a purpose of this 
invention to use parameter avatars 232 that are very small and that 
will appear shortly after the person joins. 
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Figure 4 is a block diagram for the reconstruction of a parameter 
avatar 232. A set of avatar parameters 230 is sent to a personal 
computer 3 that enable a parameter avatar 232 to be constructed from a 

5 general database of avatar information 231. Avatar parameter 230 
download assumes that there is a general database of avatar 
information 231 already downloaded at the personal computer 3 from 
which a parameter avatar 232 can be quickly generated from a small set 
of avatar parameters 230. The general database of avatar information 

10 231 is downloaded the first time that an avatar conference is accessed 
on a personal computer 3 and remains for later avatar conferences 
unless it is deleted. 

Figure 5 is an example table of avatar parameters 230 that can be used 
15 to define a parameter avatar 232 from a general database of avatar 
information 231. This set of avatar parameters 230 is typically in 
the range of 100 to 1, 000 bytes in size but may be smaller than 100 
bytes or larger than 1000 bytes and thereby be sent over the network 2 
from the avatar hosting server 4 to the personal computer 3 very 
20 quickly. The parameter avatar 232 can also be assembled very quickly 
from the database 231 and the avatar parameters 230. In this way, an 
avatar of the new participant can be constructed quickly that would 
look like that person from a distance. 

25 This parameter avatar 232 may be displayed until such time as the 
photo-realistic avatar 238 has been downloaded from the avatar hosting 
server 4 to the personal computer 3 at which point the parameter 
avatar 232 is automatically replaced with the photo-realistic avatar 
238. The photo-realistic avatar 238 can be downloaded progressively, 

30 such that rather than a sudden change from a parameter avatar 232 to a 
photo-realistic avatar 238, the user sees a slow morphing from one to 
the other over a period of time. Progressive download can be 
implemented in many ways. One implementation might be to first 
download the geometry, then the joint positions, then the textures. A 

35 second implementation might download low-resolution textures followed 
by high- resolution textures. 

It is possible to use a large set of avatar parameters 23 0 and the 

13 
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power of each parameter is such that an extensive database 231 can be 
used to generate very life-like parameter avatars 232. The most 
distinctive part of a human is the face. Faces can be generated that 
are very close to the actual person's face from as little as 50 avatar 
5 parameters . 



Avatar generation 

Avatars and parameter avatars may be generated in several ways : 

a photo-realistic avatar 238 may be generated from photos of the 
10 user 

a parameter avatar 232 may be built up manually by the user without 
using photos of the user 
- a parameter avatar 232 may be automatically generated from a photo- 
realistic avatar 238 of the user 

15 

Parameter avatar generated from photo-realistic avatar 

Figure 6 is a block diagram of apparatus for generating and editing a 
parameter avatar 232. The parameter avatar 232 may be generated 
automatically or manually. 

20 

A set of avatar parameters 230 is automatically created from a photo- 
realistic avatar 238 of the person by a parameter avatar generator 233 
with avatar editing software 234. There is enough information in a 
photo-realistic avatar 238 for the avatar generator 233 to be 
25 relatively simple to create for those skilled in the art. The 
parameter avatar generator 233 is shown resident on a personal 
computer 3 but may be resident on an avatar hosting server 4 or any 
other server or computer on the network 2 . 

30 Parameter avatar generated manually by user 

If a user 17 has not yet had a photo-realistic avatar 238 made of 
himself , then he can quickly create a set of avatar parameters 23 0 for 
a parameter avatar 232 that is roughly similar to him by providing 
input into the parameter avatar generator 233. Parameter avatar 
35 creation in the parameter avatar generator 233 is by selection by the 
user 17 of a number of graphical alternatives such as hairstyles and 
by entry by the user 17 of data such as height. 
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In the situation where a new user without an avatar needs to join his 
first avatar conference as quickly as possible, it is imperative that 
it is possible to create a 'rough' parameter avatar as quickly as 
possible. In these situations, users are very impatient and the 
5 interaction in which the parameter avatar is created must be very 
efficient and fast. Typically, under time pressure, the user may be 
prepared to spend 30-60 seconds on this interaction. The interaction 
would normally be one of selection of options with a mouse click 
rather than typing in data. Later on, the user may go back and spend 
10 more time refining his parameter avatar. It is a purpose of this 
embodiment that there are two or more ways of generating a parameter 
avatar depending on the amount of time that the user has available. 

Action impersonation parameters 

It is a purpose of this avatar user interface system invention that 
action impersonation parameters may be used to characterise how a 
person moves. One of the objectives of a successful avatar user 
interface system invention is anima-realism. It is a first objective 
for an avatar to move anima-realistically such that a user who does 
not know the person whose avatar it is, thinks that the animation is 
realistic. It is a second objective for an avatar to move anima- 
realistically whilst impersonating the actions of the person whose 
avatar it is, such that a user who knows the person whose avatar it 
is, thinks that the animation is both realistic and typical of that 
person. Achievement of this second objective will eliminate any 
dissatisfaction by the user of seeing an avatar of someone he knows 
behaving uncharacteristically and enable a deeper sense of copresence 
from use of the avatar user interface system invention. This avatar 
user interface system invention achieves the second objective by using 
action impersonation parameters. 

Figure 7 is a list of action impersonation parameters 332 stored in 
the memory 345 of a personal computer 3. Action impersonation 
parameters 332 include: walking 400, running 401, ambient motion 
35 whilst standing 402, ambient motion whilst sitting 403, gestures 
whilst talking 404, facial expressions whilst talking 405 and lip 
synchronisation whilst talking 406. In the example of the action 
impersonation parameter for gestures whilst talking 404, there are a 

15 
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number of possible gestural animations (actions) that might be 
associated with this % gestures whilst talking' action impersonation 
parameter 404. These could include: waving hands excitedly in a beat 
mode whilst talking and moving hands to time with the end of a 
sentence . 

Action impersonation parameters 332 are not limited to the above 
characteristics, but may be extended to include any characteristics 
required in an application of this avatar user interface system 
invention. For the purposes of this disclosure, a reference to action 
impersonation parameters 332 will mean reference to either or both of: 
types of action impersonation parameter and action impersonation 
parameter values . 

Values for action impersonation parameters 332 depend on the type of 
action and its definition. Values are set for action impersonation 
parameters 332 of a particular person in their avatar 5. 
Alternatively, values may be assigned as a set of action impersonation 
parameters 332 for a generic person in or with a context. Examples of 
sets of generic values might include: 

an Italian person 

a hyperactive person 

a person in a meeting 

a hyperactive Italian in a meeting 

A context for generic impersonation parameters might be a 
communication context. Examples of communication contexts include: 
meetings, product presentations, virtual exhibitions, receptions, 
major conferences, security situations, interactive game playing, 
exercise and practicing. 

Values may also be assigned for individual action impersonation 
parameters 332 that are characteristic of a style. An example is 
walking, where styles of walk can be defined such as a rolling gait, a 
mincing step etc. 

It is a purpose of this embodiment to disclose a manual process for an 

appropriate activity 337 of (i) defining types of action impersonation 
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parameter 332 involving the analysis of video of one or more people 
undertaking the activity 337 and (ii) deriving action impersonation 
rules 333 as to the context and frequency of use of each type of 
action impersonation parameter 332 in the activity 337. In the 
5 disclosure of this manual process, the activity used by way of example 
is a meeting, but this embodiment is not limited to the activity of 
meetings and is applicable to most types of human activity. 

Figure 7a is a flow diagram illustrating the process for defining 
action impersonation parameters 332 and action impersonation rules 333 
for an activity 337. In the first step S1000, a significant corpus of 
videos 336 of meetings is recorded. Each meeting will typically 
require several video cameras 29 to synchronously record different 
participants at a sufficient resolution. Using a plurality of cameras 
29 overcomes the problem of one camera not being able to image 
participants to a high enough resolution sitting all the way around a 
table. Meetings with different numbers of participants are recorded. 
Meetings with people from different cultures may be recorded. 
Meetings with people of different personalities may be recorded. A 
video corpus 336 of 20-50 hours is a typical size for an activity 337. 

In the second step S1001, the corpus is processed by a trained person 
along a timeline. The actions of each participant may be related to a 
number of parameters such as status, activity type (speaking, 
25 listening, observing) , speech content and emotion. The result is an 
annotated timeline 334 with actions of each participant related to the 
parameters . 

In the third step S1002, the annotated timeline 334 is analysed to 
30 produce: (i) a type definition of each possible action impersonation 
parameter 332, (ii) a set of rules that can be incorporated in a 
finite state machine 333. 

Figure 8 is a block diagram of apparatus for generating and editing 
35 action impersonation parameters for an avatar 5 of a particular 
person. Action impersonation parameters 332 may be set manually by 
providing input from the user 17 into the action impersonation 
generator/editor 335. The user 17 may be the particular person whose 

17 
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avatar it is or someone else such as a friend, a family member or an 
expert providing a service. Action impersonation parameters 332 may 
be edited manually by providing input from the user 17 into the action 
impersonation generator/editor 335. 

5 

Individual action impersonation parameter setting in the action 
impersonation generator/editor 335 may be by manual selection by the 
user 17 of a number of high-level visual alternatives for each 
individual action impersonation parameter such as walking style and by 
10 entry by the user 17 of data such as whether a particular gesture is 
typically used. 

In the situation where a new user without an avatar needs to join his 
first communication context such as an avatar conference as quickly as 

15 possible, it is imperative that it is possible to set up a 'rough' set 
of generic action impersonation parameters as quickly as possible. 
This can be achieved at the highest level by providing the user with a 
small number of pre-set generic action impersonation parameter sets to 
choose between. Examples include: 

20 - passive 
active 

hyper-active 

An alternative high-level way of setting action impersonation 
25 parameters quickly is to choose between pre-set action impersonation 
parameter sets according to culture . The user may choose between 
cultural characteristics such as: 
Anglo-Saxon 
Japanese 
30 - Hispanic 
Italian 



After personal action impersonation parameters 332 have been set in a 
high-level, generic way, they may be edited at a low-level where they 
35 can really be fine-tuned to the way a person moves. For instance a 
person may be hyper-active and use a characteristic gesture a lot but 
never use another gesture. By editing at a low-level, the action 
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impersonation parameters 332 may be refined such that a user who knows 
the person whose avatar it is, thinks that the animation is both 
realistic and typical of that person, 

5 It is a purpose of this embodiment to disclose a manual process for 
defining a set of action impersonation parameters 332 for a particular 
person using an action impersonation generator/editor 33 5 involving 
manual input by a user 17. In the first step, the user 17 makes 
selections from a number of choices at a high level . In the second 

10 step, the user edits those selections at a lower level. 

For automatic setting of action impersonation parameters, a video 
camera 2 9 may make video recordings 336 of a person carrying out a 
number of pre-defined actions. The action impersonation 

15 generator/ editor 335 may automatically set the action impersonation 
parameters by automatic processing of the video recording. In this 
process, the emphasis is on replicating the particular person's style 
in actions that have different styles. The camera 29 may be mounted 
in a booth 18 . 

20 

It is a purpose of this embodiment to disclose an automatic process 
for setting a set of action impersonation parameters 332 for a 
particular person using an action impersonation generator/editor 335. 
In the first step, video recordings 3 36 are made of a person carrying 
25 out a number of defined actions. In the second step, the action 
impersonation generator/editor 335 automatically analyses the video 
recordings 336 to generate a set of action impersonation parameters 
332. 



30 Action impersonation parameters may be set by a number of means in 
addition to those disclosed. For example, videos can be made of a 
person carrying out a number of tasks and an expert may study the 
video and set the action impersonation parameters. 

35 The processes disclosed above for manually and automatically 
generating, setting and editing action impersonation parameters 332 
define a number of methods by example. This aspect of the invention 
is not limited to the processes disclosed, but covers all processes 
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for manually and automatically generating, setting and editing action 
impersonation parameters 332. 



Avatar numbering 

5 Each avatar 5 has a unique avatar number 8 . An avatar 5 may contain 
multiple visual avatar data including a photo-realistic avatar 238, a 
parameter avatar 232 and an animatable image avatar 382. When an 
avatar 5 is first created, it is allocated a unique avatar number 8. 
At any point thereafter, visual avatar data of different types may be 

10 added, deleted or edited. 

Avatar access permission 

The password 9 when used together with the avatar number 8 gives the 
user 17 access to change the avatar 5 including other types of data 

15 such as personal data 341. The display permission flag 259 if set by 
a user 17 with a password 9 and avatar number 8 gives permission to 
all other users 17 to use the avatar 5 for viewing purposes such as in 
a displayed avatar user interface 260 without need of the password 9. 
Access permissions are not limited in this invention to the password 9 

20 and the display permission flag 259. A range of access permissions 
may be created for access to different types of data by different 
users . 

Avatar Hosting Server 

25 Figure 9 is a schematic diagram of an avatar hosting server system. 
The avatar hosting server 4 contains a database 6, avatar hosting 
management software 229, and avatars 5. In this embodiment, each 
avatar 5 has a unique avatar number 8 and a password 9 . The avatar 
hosting server 4 may also contain one or both of billing software 237 

30 and avatar generation software 222. 

When the avatar hosting management software 229 on the avatar hosting 
server 4 receives a request 7 over the network 2 from a personal 
computer 3 for an avatar 5, then the avatar hosting management 
35 software 229 will check with the database 6 to see if the request is 
accompanied by a valid avatar number 8 and password 9. If the request 
7 is valid, then the avatar hosting management software 229 will send 
the requisite avatar 5 to the personal computer 3 in such a form that 
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it can be changed. If the request 7 is not accompanied by a valid 
password 9, then the avatar hosting management software 229 will check 
to see if the display permission flag 259 is set for the avatar 5 with 
avatar number 8. If the display permission flag 259 has been set, 
5 then the avatar hosting management software 229 will send the 
requisite avatar 5 to the personal computer 3 in such a form that the 
avatar 5 can only be displayed and cannot be changed. If the request 
7 is not accompanied by a valid password 9 and the display permission 
flag 259 is not set for the avatar 5 with avatar number 8, then the 
10 avatar hosting management software 22 9 will not send the requisite 
avatar 5 . 

Photo-realistic Avatar Generation 

Photo-realistic avatars 238 of people are generated and edited from 
15 digital images 19 of people, usually taken from several sides of the 
person using a camera 221 using generation software 222 and avatar 
editing software 234 in an Avatar Generator Editor (AGE) 235. 

The quickest and least technical means of generating these digital 
images 19 is by the person using a special avatar generation apparatus 
18 such as an avatar booth run by generation management software 236. 
The generation management software 236 usually takes the images 19 of 
the person using a camera 221 and generates a photo-realistic avatar 
238 using AGE software 235 on a personal computer 3 . The special 
avatar generation apparatus 18 usually contains means for regulating 
the quality of the images 19 that reduces or eliminates the need for 
skilled processing of the images 19 before they enter the AGE software 
235. Such regulation means usually include fixed camera settings, 
controlled lighting levels and a uniform colour and shape background 
and floor such as a chroma green sheet but neither have to include 
these regulation means or are limited by these regulation means. 

Alternatively, any camera 221 can be used to take images 19 of the 
person in a largely unregulated way. These images can be transferred 
35 to a personal computer 3 on which AGE software 235 is resident. 
Alternatively, the images 19 can be sent over the network 2 to the 
avatar hosting server 4 on which there is also generation software 222 
that automatically generates a photo-realistic avatar 238 without any 

21 
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user intervention. Alternatively, the images 19 can be sent over the 
network 2 to an avatar generation service 223 that uses an AGE 235. 



Avatar Generator Editor (AGE) 

5 The automatic generation of an avatar or a parameter avatar generates 
an imprecise avatar. The avatar generated may not at first be 
pleasing to the user, in the same way that photographic images of a 
person are often not pleasing to the person. The user may think that 
the avatar does not represent himself or even his self-image. 

10 

In avatar and parameter avatar generation, an interactive editing 
process is possible to change the avatar. There are two main types of 
editing, both of which may be used: 

15 - low-level: changing the avatar by touching up manually the 3D 
shape, textures, texture coordinates and joint positions 
high-level: changing the avatar by interactively adjusting avatar 
parameters from which the avatar is regenerated 

20 It is a purpose of this embodiment to disclose an avatar generator 
editor (AGE) 235 containing a photo-realistic avatar generator 222 or 
a parameter avatar generator 233 and avatar editing software 234 in 
which editing can take place at low level or high level or both. 

25 Peer to peer avatar serving 

In an alternative to using an avatar hosting server 4, a peer to peer 
avatar serving system can be used. In a peer to peer avatar serving 
system, an avatar hosting server 4 is not required and the user's 
avatar 5 that is resident in local storage 274 on his personal 
30 computer 3 can be sent to all other participant's personal computers 3 
directly over the network 2 . 

Avatar Hosting Services 

Figure 10 is a schematic diagram of an avatar number 8. The avatar 
35 number 8 comprises two parts: an avatar hosting service identity 
number AHS-ID 224 and an avatar identity number A-ID 225. If there is 
multiple avatar hosting servers 4 on the network 2, then each avatar 
hosting server 4 has an avatar hosting service identity AHS-ID 224. 

22 



WO 03/058518 



PCT/GB03/00031 



There is an avatar hosting registry server AHR 226 on the network 2 
run by AHR management software 227 stored in memory 347. When a 
personal computer 3 needs an avatar 5 it takes the avatar hosting 
service identity AHS-ID 224 and sends it to the AHR management 
software 227 to request the location of the avatar hosting server 4 
corresponding to the AHS-ID 224 on which the avatar 5 is stored. 

Each avatar identity number 225 for a particular AHS-ID 224 is unique. 
The personal computer 3 contacts the avatar hosting management 
software 22 9 on the correct avatar hosting server 4 with the location 
provided by the AHR management software 227 and retrieves the avatar 5 
using the AHS-ID 224. 

It is a purpose of this first embodiment to disclose a process for 
retrieving an avatar comprising the following steps: 
user providing an avatar number and password; 

a computing appliance sends the avatar number and password to the 
network location of an avatar hosting service; 

avatar hosting server management software on the avatar hosting 
service checks a database to verify that the avatar number and 
password are valid; 

if the avatar number and password are valid, then avatar hosting 
server management software on the avatar hosting service sends the 
avatar to the computing appliance. 

It is a further purpose of this first embodiment to disclose a process 
for retrieving an avatar using an avatar hosting registry server 
comprising the following steps: 

user providing an avatar number and password; 

a computing appliance sends an avatar hosting service identity 
number to an avatar hosting registry server; 

the avatar hosting registry server sends to the computing appliance 
the network location of the avatar hosting service corresponding to 
the avatar hosting service identity number; 

the computing appliance sends the avatar number and password to the 
network location of the avatar hosting service; 
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avatar hosting server management software on the avatar hosting 
service checks a database to verify that the avatar number and 
password are valid; 

if the avatar number and password are valid, then avatar hosting 
5 server management software on the avatar hosting service sends the 

avatar to the computing appliance. 

This invention is not limited to this one way of designing an avatar 
number 8 but includes all other ways of designing an avatar number 8 
10 such that the avatar 5 with avatar number 8 may be located on one or 
more avatar servers - 

Personal Computer 

Figure 11 is a block diagram of a personal computer 3 with an avatar 
15 user interface 260 in an environmental location 273 . The personal 
computer 3 includes a display device 264, a webcam 29, a headset 11 
comprising microphone 12 and headphones 13, a keyboard 14 and a mouse 
15 in a cabinet 16 running an operating system 20 which in this 
embodiment is the Microsoft Windows XP operating system, an avatar 
20 user interface software application 262 as a plug-in to the browser 
263 in which the displayed avatar user interface 260 is seen by the 
user 17 in the browser window 21 on the desktop 423. The headset 11 
is normally worn by the user 17 of the personal computer 3 during an 
avatar conference in such a way that the user 17 can hear through the 
25 headphones 13 and speak into the microphone 12 . Each PC peripheral 
may be connected to the PC cabinet 16 by a wired or a wireless method; 
if it is a wireless method, the peripheral may contain a battery or be 
connected to a power source . 

30 Information flowing 

During an avatar user interface session (avatar conference call) , 
those participating in the session will communicate via information 
flowing between the personal computers 3 and the session server 1. 
This information can be in different media formats including: voice, 
35 music, video, avatar animation, 3D models, presentation images, text, 
office application sharing, spreadsheets, word processor documents and 
whiteboard annotation. 
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Session server arrangement 

The session server 1 may be resident on the network 2 in a server- 
client network design. Alternatively, the session server 
functionality may be resident on a personal computer 3 in a peer to 
5 peer network design. In this way, the personal computers 3 of the 
users 17 with session server functionality resident on at least one 
personal computer 3 is sufficient to use the avatar user interface 
system 261 over the network 2 without a separate session server 1. 

10 Display arrangement 

According to this embodiment, Figure 12 is a diagrammatic 
representation of avatar user interface functionality in a conference 
application. The personal computer 3 is running a personal computer 
operating system user interface 20 which is visible in the display 

15 device 264 as a desktop 423 . The personal computer 3 is also running a 
network browser which is visible in the display device 264 as a 
browser window 21 and which in this embodiment is the Microsoft 
Internet Explorer browser Version 6. The personal computer 3 is 
connected over the network to the session server 1 via the browser 

20 window 21. The Uniform Resource Locator (URL) 22 active in the 
browser window 21 points to the session server 1. In the browser 
window 21 during a conference there is the avatar session user 
interface 10 comprising a large conference window 23, two smaller 
conference windows 24, 25 and one or more interaction windows 26. The 

25 large conference window 23 has control buttons 27; these buttons 
change depending on which media is being shown in the large conference 
window 23. An interaction window 26 has mode buttons 28. 

The user interface may be x always on' for the user 17 to speak. 
30 Alternatively, a button 272 is depressed by the user 17 when speaking 
and is acknowledged with the button 272 changing colour to show that 
the microphone is live. The button may also be activated by pushing a 
key on the keyboard 14 . 

35 The large conference window 23 is used to show whichever media is in 
use and requires the maximum resolution. The two smaller conference 
windows 24, 25 are for two other media formats. 
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The interaction windows 26 have several functions including: text 
chat, attendance list, address list, agenda and audio settings. The 
number of interaction windows 26 can be reduced by means of a window 
having several modes. In this embodiment there are two windows 26. 
5 The first window 26 is permanently dedicated to text chat. The second 
window 26 is controlled by mode buttons 2 8 for swapping between 
functions: attendance list, address list, agenda and audio settings. 



The three conference windows 23, 24, 25 may have the same aspect ratio 
10 or may have different aspect ratios depending on the system design. 
The three conference windows 23, 24, 25 show the three avatar 
conference media windows: the presentation, the whiteboard and the 
meeting room. The user may select the media window in one of the two 
small conference windows 24, 25 to go into the large conference window 
15 23 and the media window currently in the large window swaps back into 
the small window vacated by the selected media window. 

Presentation media window 

According to this embodiment, Figure 13 is a block diagram of a 
20 presentation media window 30 during an avatar user interface session. 
The presentation media window 30 can show images, slides, video clips 
and other visual media such as Flash from Macromedia Inc (USA) or 
applications such as computer games. The presentation media window 30 
is controlled by the user using the control buttons 31 - 35, when it 
25 is in the large window, but cannot be operated when it is in a smaller 
window. There is a mode of use of the invention in which one party in 
the conference can make a presentation and the when the presenter 
changes a slide, the same slide will change in the presentation 
windows of all the parties. 

30 

Button 31 returns the presentation to the first slide. Button 32 
moves back one slide. Button 33 moves forward one slide. Button 34 
goes to the last slide in the presentation. Button 35 toggles between 
local control of the presentation and presenter control of the 
35 presentation. 



Whiteboard media window 

According to this embodiment, Figure 14 is a block diagram of a 
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whiteboard media window 40 during an avatar user interface session. 
The whiteboard 4 0 is controlled by sets of control buttons 41-43 
when it is in the large window but cannot be operated when it is in a 
smaller window. The session server 1 maintains the whiteboard content 
5 as being identical on all client personal computers 3. The whiteboard 
consists of multiple pages on which content can be created or pasted. 
The analogy is that of a flip-chart which has multiple pages. 

The set of control buttons 41 are similar in function to buttons 31 to 
10 35 in the presentation window. They control which of the whiteboard 
pages is displayed. There can be local control of the whiteboard 
pages or control can be handed to the presenter by means of a mode 
toggle key. 

15 The set of control buttons 42 presents a palette of colours for the 
person creating content to choose from. This is similar to the 
Microsoft Paint application. 

The set of control buttons 43 presents a collection of tools for 
20 creating content. Examples include text mode, line drawing mode and 
rubout mode. These tools are similar to the Microsoft Paint 
application. 

Meeting Room media window 

25 According to this embodiment, Figure 15 is a representation of an 
example of a meeting room media window 50 during an avatar user 
interface session. There are 5 participants on the session. Each 
participant in the avatar user interface session is represented by 
their avatar 5 sitting around a meeting table 51. In the background 

30 is a screen 53 on which presentation slides are displayed, a 
whiteboard 54 which can be written on by the participants and the room 
comprising walls 55, ceiling 56, floor 57, door 58 with a door handle 
59 and a windowpane 60. The avatars 5 shown in the meeting room media 
window 50 are labelled Ted, Jill, Andy and Pam. The avatar 5 labelled 

35 Pam is using a mobile phone 79. The avatar of Bert is not shown in 
Figure 15. Bert is viewing the meeting room media window 50 and is 
the fifth participant on the session. Bert does not see an avatar of 
himself. There may be other items in the room such as plants, sky 
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visible through the windowpane 60, birds flying in the sky and trees 
visible through the windowpane 60. 

The user 17 arranging the conference may select from several designs 
5 of meeting room 50 offered by an avatar conference service provider. 
A selected meeting room 50 may be informal or formal. It may be large 
or small. It may be designed to suit a particular culture eg 
Japanese . 

The buttons 45-48 control the mode 84 in which the meeting room media 
window operates. Button 45 selects mode Ml. Button 46 selects mode 
M2 . Button 47 selects mode M3 . Button 48 selects mode M4 . The layout 
button 85 controls the layout for modes in which the layout is an 
option. 

The meeting room media window in an avatar user interface session is 
useful to different people at different times: 

if you have never physically met a person who is on the session, it 
is usually interesting to see their avatar to see what they look 
like 

- when you come into the session, it is useful to visualise who is 
already there by seeing their avatars 

- when someone arrives at or leaves the session, you can see who it 
is without the session being interrupted 

- if you do not recognise the voice of the person speaking, you can 
see their avatar and their name label in the window 

Video Conference Metaphor 

In a video conference from multiple locations, there is usually either 
30 a split screen with a separate section in the monitor for each 
location or a separate monitor for each location. Many people have 
taken part in video conferences and are used to the Video Conference 
metaphor in which each location and often each participant are seen in 
a separate display section. 

35 

The main visual drawbacks of video conferencing are: 

(a) that there is not a cohesive space for the meeting - each window 
is unrelated to the others 
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(b) that the patterns of gaze of the participants as seen in the 
monitors are disjointed; each participant tends to look in a 
different direction; this is at its worst in desktop video 
conferencing when webcams situated on top of personal computer 
monitors are used and the participant looks at the monitor and 
not at the webcam; this is unlike a real meeting in which the 
gaze of each participant has a function and there is a cohesive 
whole . 

These two drawbacks significantly reduce the sense of copresence that 
video conferencing might offer and make the experience of 
participating in a video conference unsatisfactory. The concept of 
copresence has arisen comparatively recently and as yet there seems to 
be no commonly accepted definition of it. However, there is general 
agreement that where a high sense of copresence is experienced by 
users of the virtual environment, there are benefits varying from 
greater task efficiency to less distraction. 

According to this embodiment, Figures 16a, 16b, 16c and 16d are 
schematic diagrams to illustrate the virtual camera positions in the 
virtual video conference. It is a plan view. Cameras 61, 62, 63 and 
64 view avatars 5 labelled Ted, Jill, Andy and Pam respectively. 
Behind avatars 5 are four backgrounds 65, 66, 67 and 68. 

According to this embodiment, Figures 17a, 17b and 17c are schematics 
of three possible layouts in the meeting room media window 50. Layout 
1 shows the avatars 5 in a virtual room 69 sitting around a virtual 
table 51. Layout 2 shows the avatars 5 in a straight line 
arrangement. Layout 3 shows the avatars 5 in a split screen 
arrangement. The backgrounds 65, 66, 67 and 68 may be identical, 
similar or completely different depending on what works best for the 
selected Layout 1, 2 or 3 . The layout is selected using layout button 
85. 

Meeting Room Metaphor 

The Meeting Room media window 50 of the avatar conference is a 
metaphor for an actual meeting that is being video-cast live. An 
example might be a group discussion broadcast from a television 
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studio. By using photo-realistic 3D avatars, a photo-realistic 3D 
meeting room, anima-realistic animations of the avatars and good 
camera direction, it is possible to suspend the disbelief of the 
viewer on the session such that he thinks it is an actual meeting 

5 where he is the only person who is not in the room. This gives the 
viewer a higher sense of copresence in the avatar user interface 
session than is obtainable in a telephone conference call. The 
objective is for the enactment to be so realistic that the viewer 
finds it hard to tell the difference between the avatar conference and 

10 a live video of the actual meeting room. 

According to this embodiment, Figure 18 is a plan view of the virtual 
meeting room illustrating possible virtual camera positions. Camera 
71 is the overview camera and will show the view illustrated in Figure 

15 15. Camera 71 is positioned at the eye position of the Avatar called 
Bert who is seeing the Meeting room media window 50 in Figure 15 on 
his personal computer 3. Cameras 72, 73, 74 and 75 view avatars 5 
labelled Ted, Jill, Andy and Pam respectively. Camera 76 shows the 
presentation screen 53 . Camera 77 shows the whiteboard 54 . Other 

20 cameras may be positioned at any location and oriented at any 
orientation. 

Meeting Room media window modes 

There are four modes Ml, M2, M3, M4 for the Meeting Room media window. 
25 The user is free to select a preferred mode using the buttons 45-48. 

In each mode, the view presented is from a virtual camera position. 
In each mode there are one or more virtual cameras . A virtual camera 
can have camera controls such as zoom and pan in addition to spatial 
30 movement . 

According to this embodiment, Figure 19 is a set of four timelines of 
the camera shots during the avatar conference for each Mode. In Mode 
Ml, by way of example, there is only one shot SI which lasts for the 
35 duration of the avatar conference and is shot from Camera 71. In Mode 
M2, by way of example, the first shot S10 is from Camera 71 and is an 
overview view similar to that in Figure 15. This is followed by shot 
Sll from Camera 72 which shows Ted. This is followed by shot S12 from 

30 
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Camera 76 which shows the presentation screen. The avatar conference 
timeline continues until the last shot S17 from Camera 71. In Mode 
M3, by way of example, there is only one shot S20 which lasts for the 
duration of the avatar conference and is in Layout 1 using Cameras 
5 61, 62, 63 and 64. In Mode M4, by way of example, the first shot S30 
is in Layout 1. This is followed by shot S31 from Camera 61 which 
shows Ted against background 65. This is followed by shot S32 from 
Camera 76 which shows the presentation screen. The avatar conference 
timeline continues until the last shot S37 in Layout 1. 

10 

Ml Meeting room; Overview 

This mode Ml uses the Meeting Room metaphor. An overview from a 
single virtual camera of: the table 51, all the avatars around it 5, 
the whiteboard 54 and the presentation screen 53. There are no other 
15 cameras. 



The viewer's avatar is not present. If the viewer's avatar were 
present, then the viewer sees his own avatar animating and in 
particular lip syncing whilst he talks and the effect would be like a 
20 mirror that reflects actions you do not make. Seeing your own avatar 
breaks the metaphor and reduces the copresence felt by the viewer. 
The camera viewpoint can be from where the viewer could be sitting at 
the table or any other viewpoint that ^misses' the viewer's avatar. 

25 M2 Meeting room; Chat show 

This mode M2 uses the Meeting Room metaphor. Multiple cameras are 
used but there is only one window. The result is like a televised 
chat show with cuts from one camera to another as the chat develops. 

30 M3 Video conference; Overview 

This mode M3 uses the Video Conference metaphor but improves on it to 
partially overcome the drawbacks of a real video conference. The 
Meeting Room media window 50 is laid out in sections and shows one 
participant's avatar in each section of the window. Referring again to 
35 Figure 15, one of the three layouts in Figures 17a, 17b and 17c can be 
chosen by the user by toggling button 85. In all of the three 
layouts, the gaze direction of the avatars can be controlled to 
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overcome the drawback of a video conference or desktop conference in 
which the gaze direction of the avatars is disconcerting to the user. 



Layout 1 helps to give a sense of cohesive space for the video 
conference in that the layout is enhanced with a virtual room 69 which 
can include all items shown in Figure 15 including a virtual table 51. 

Layout 2 goes half way to providing a sense of cohesive 3D space by 
putting the avatars in a line but does not include a virtual room and 
virtual table . 

Layout 3 is a split screen layout that maximises the display 
resolution per participant and is useful where there are a large 
number of participants in the avatar conference. 

M4 Video conference; Chat show 

This Mode M4 uses the Video Conference metaphor. Multiple cameras are 
used but there is only one window. The result is like a televised 
multi-location show with each participant in a different location with 
cuts from one camera to another as the chat develops. 

Activities during an Avatar Conference 

The avatar user interface system is used in a variety of ways. The 
following is a list of collaborative meeting activities and the 
percentages are an indication of the % of meeting time devoted to each 
activity type when averaged over a wide variety of meeting types. 



Discussion (no media) 59% 

Presentation (slides, images ...) 27% 

Discussion (white board) 5% 

Shared application (eg Word, Excel) 4% 

Watch video clip 3% 

Listen to audio clip 1% 

View 3D virtual object 1% 



In addition to the collaborative meeting activities, individuals or 
small groups can perform other activities. These include: 
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Text chat 

- Whispering (private voice connection) 
Break -out meeting 
Preparing a whiteboard sheet 
Using an on-line translation service 

Mult i- tasking with non-meeting activity eg reading, doing e-mail 
Conference types 

There are a variety of different conference types, agendas and 
objectives. Designers may discuss a 3D object. Advertising people 
may listen to radio adverts or view prototype packaging images, video 
clips of TV adverts. Businessmen may view a slide presentation. 
Salesmen may present new products. Students may take part in an e- 
learning course led by a tutor or they may work collaboratively 
together . 

Different user interface displays 

It is a purpose of this embodiment that the graphical display of the 
avatar user interface varies according to the computing appliance 
capabilities, the type of conference being held and user preference. 
Figure 12 is just one example of an avatar conference display. This 
invention is not limited to the one example shown in Figure 12. 

Events during an Avatar Conference 

The avatar conference is a series of events. The events are largely 
un- scripted, although there is often an agenda and a Chairman whose 
objective is to ensure that the meeting follows the agenda. The 
following events are listed by way of example only and do not form a 
comprehensive list of all events that can take place in an avatar 
conference : 

1. Person joins the conference 

2 . Person leaves the conference 

3 . A person stops speaking 

4 . A person starts speaking 

5. Two or more people speak simultaneously 

6. A presentation slide is projected 
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7. The presentation projector is turned off 

8. A video is shown 

9. A new whiteboard sheet is drawn on 

10. A previous whiteboard sheet is turned to 

5 11 . A camera shot times out and a new camera shot begins 
12 . Move onto a new agenda item 
13 . Write a minute of a point just discussed 

Most of these events are generated as a result of the action or 
10 inaction of a participant in the conference as detected by input 
mechanisms such as keyboard, mouse and microphone into the avatar 
conference system. 



Software director 

15 According to this embodiment, Figure 20 is a block diagram of a 
software director 80 and an avatar engine player 210. The flow of 
events 81 into a software director finite state machine 80 is shown 
with the resulting flow of camera shots 82, light settings 214 and 
actions 83 such as avatar animations into an avatar player engine 210. 

20 The avatar player engine 210 also uses at least one avatar 5, the 
scene 211, props 215 and the lighting model 212 to combine with the 
shots 82, light settings 214 and actions 83 to generate and display 
the avatar conference on the avatar session user interface 10. A 3D 
graphics processor chip 213 is often used in the personal computer 3. 

25 

Since no physical meeting room exists, the avatar conference can be 
enacted with each event being acted out by an avatar. The enactment 
of the avatar conference can be shown from multiple camera viewpoints 
and camera movements such as translation, zoom and pan. 

30 

It is a purpose of this embodiment of the invention that a software 
director 80, which is a finite state machine, directs the enactment 
and visualisation of the meeting in the avatar conference media window 
by reacting to the events 81 as they occur during the meeting. 

35 

The software director 80 takes into account the mode 84 and layout 85. 
A library of actions 87 is available. An action generator 88 is 
available. These actions are animations for avatars. Action 
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impersonation parameters 332 from at least one avatar 5 are available. 
In addition, timers 86 are started after some actions and new events 
are triggered by timers 86 expiring. 

5 The software director finite state machine 80 is effectively a 
software agent that initiates actions triggered by events according to 
rules. In a constrained activity such as an avatar conference, it is 
quite feasible to completely define all the events, all the actions 
and the set of rules for actions being generated by events. 

10 

In addition to fixed rules, some actions are generated randomly. The 
generation of random actions such as camera cuts and avatar gestures 
can make the avatar conference more realistic and less predictable to 
the viewer. 

15 

In generating actions for an avatar 5, the software director 80, takes 
into account the action impersonation parameters 332 of that avatar 5. 
In this way, the actions 83 generated for that avatar 5 can be more 
characteristic of the user 17 that the avatar 5 represents. 

20 

For example, if the action impersonation parameter 332 for gestures 
whilst talking 404 indicates a lot of gestures, then the software 
director 80 will generates actions 83 involving a lot of arm movement. 
In a similar way, if the action impersonation parameter 332 for lip 

25 synchronisation whilst talking 406 indicates very little lip movement 
whilst talking, then the software director 80 will generates actions 
83 for lip synchronisation involving very little talking. Rules for 
the five other disclosed action impersonation parameters 332 [400, 
401, 402, 403 and 4 05] may be drawn up in a similar way and for any 

30 other action impersonation parameters 332 that are defined and used. 

It is also possible to use other software agent approaches to make the 
conference realistic; one example is fuzzy logic. 

35 Animation player engine 

The scene 211 is typically that of a room as illustrated in Figure 15. 
Each item in the scene is modelled in 3D. To achieve a close to video 
experience that encourages a sense of presence, each item is made of 

35 



WO 03/058518 PCT/GB03/00031 
photo-realistic textures as well as a 3D topology. Props 215 are 3D 
items in the scene that can be moved by the avatars or under self- 
power. Props 215 are modelled in a similar way to the scene. 

5 A lighting model 212 is used. The light levels 214 of the lights in 
the lighting model 212 can be changed by the software director 80 in 
reaction to events during the avatar conference. 

The visual aspect of the avatar conference is a collection of 3D 
10 content including multiple avatars, a scene, props and a lighting 
model. If rendering effects such as shadows are required, the 
complexity increases. This can provide a large load on the personal 
computer 3 . More and more often, a powerful 3D graphics processing 
chip 213 is built into the personal computer 3. In this way, it is 
15 possible for the avatar conference to achieve an acceptable frame rate 
such as 15-25 frames per second. 

Event accumulator 

According to this embodiment, Figure 21 is a block diagram of events 
20 on a personal computer 3 and a session server 1. It illustrates the 
event accumulator 89 on the session server 1 that gathers events 81 
and sends the accumulated events 81 to the software director 80 on 
each personal computer 3 via a network 2. A software director 80 can 
also generate events 81 and send them to the event accumulator 89. 

25 

The event accumulator 89 on the session server 1 receives events 81 
from a variety of sources including: 

Software director 80 on personal computers 3 
Text chat software 
30 - Agenda manipulation software 
Login software 
Slide presentation software 
Whiteboard software 
Lip sync generation 

35 

The session management software 228 manages one or more user interface 
sessions on the session server 1. 



36 



WO 03/058518 



PCT/GB03/00031 



Session and Hosting Payment 

Referring again to Figure 9, billing software 237 on the avatar 
hosting server 4 monitors aspects of the use of avatars such as the 

5 number of avatars hosted for a customer and arranges billing according 
to the revenue model agreed with the customer. As is appreciated by 
those skilled in billing, the billing software 237 is not limited to 
the functionality described above. For instance, the billing software 
237 could monitor other aspects of the sessions, it could apply 

10 different revenue models to different customers, it could use micro- 
payments for immediate debiting during use, it could combine billing 
for sessions, billing for avatar hosting, billing for other services 
and it could be resident on any computer or server. 

15 Plurality of meeting room arrangements and enactments 

According to this embodiment, Figures 22a, 22b, 22c, 22d and 22e are 
schematics of the five seating plans viewed by the five participants 
in the avatar conference in their five meeting room media windows 50. 
The table 51 and the presentation screen 53 are the same in each view. 

20 Each participant's avatar is abbreviated to the first letter of its 
name: B for Bert, T for Ted, J for Jill, A for Andy and P for Pam. In 
each view one avatar is not shown: the avatar of the viewer. In 
effect the seating plan is rotated with reference to the presentation 
screen 53 for each of the five views. Each meeting room arrangement 

25 is therefore different. Other seating arrangement rules can be drawn 
up. 

Since the arrangements are different for each viewer, the enactment's 
will also be different. If Ted's avatar enters the virtual room 
30 through the door 58, then in Figure 22a, Bert's view, he will have to 
walk to the far chair and sit down, whilst in Figure 22d, Andy's view, 
Ted will sit down at the chair nearest him. 

Each viewer only sees his representation of the virtual meeting room 
35 and does not see the representations of the virtual meeting room on 
other participant's personal computers 3. The meeting participants 
will not be aware of this unless they hold discussions along the lines 
of w Jill, who is sitting on your left." Therefore there should not be 
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any confusion stemming from different meeting room arrangements and 
enactment' s . 



It is a further purpose of this embodiment that each meeting room 
5 window 50 displayed on each Personal Computer 3 can show a different 
representation of the virtual meeting room and a different enactment 
of the meeting. 

Audio mixer 

10 According to this embodiment, Figure 23 is a schematic of the audio 
mixer 90. It illustrates the audio mixer 90 that is part of the 
session server 1 and includes a balance system 204 and a filter system 
205. N audio input streams 91 arrive at the session server 1 from the 
personal computers 3 over the network 2. One audio input stream 91 

15 arrives from each personal computer 3 . In addition, one or more audio 
input streams 92 might be available; audio input streams 92 can be 
generated from playing a media object during the avatar conference 
such as an audio or video clip or as streaming media channels coming 
in over the network 2; an audio input stream 92 might be voice, music, 

20 radio, TV or any other audio stream. N audio output streams 93 are 
generated by the audio mixer 90 and sent to the N personal computers 3 
over the network. 

The audio mixer 90 is a finite state machine that follows one main 
25 rule in the case of a conference where there is a single conversation 
common to all participants: the audio output stream 93 going to a 
personal computer 3 is a mix of one media object audio stream 92 and 
all the input audio streams 91 from the other personal computers 
except for the one coming from that personal computer. Audio mixing 
30 in the audio mixer 90 is digital and as will be clear to those skilled 
in the art is carried out by combining synchronised time segments such 
that the real time of each input segment from each participant is the 
same . 

35 Amplitude balancing 

The audio mixer is also able to carry out an amplitude balancing 
function using the balance system 204 by balancing the amplitudes of 
the input audio streams 91 by reducing the amplitude of loud audio 
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streams and increasing the amplitude of quiet audio streams before 
mixing. In this way participants do not need to concentrate hard to 
hear quieter participants and do not get shocked by louder 
participants . 

5 

Audio filtering 

The audio mixer is also able to carry out a filtering function using 
the filter system 205 filtering the input audio streams 91 to reduce 
annoying sound artefacts generated by the mixing process or by lags in 
10 the network 2. In this way participants enjoy a cleaner and higher 
quality audio experience during the conference. 

Whispering, groups and other sessions 

It is often the case in a conference that the meeting splits into 
15 smaller groups , each of which hold a separate conversation. 

It is also a desirable feature that two or more people can whisper 
together whilst the main conference conversation proceeds without 
distracting the other participants. This is a case where the 
20 functionality of an avatar conference can be superior to that of a 
physical conference in which people whispering together is often a 
distraction to the other participants. In a physical conference, 
participants who are whispering can hear both the main conference 
conversation and their whispered conversation. 

25 

Visual feedback in the meeting room media window 50 can be provided to 
participants showing who is whispering and who has split into a 
smaller group. A simple way is for the avatars 5 of those whispering 
to automatically get up and move to the back of the room where they 

30 can be seen chatting together by others (but not heard) . The same 
approach of forming a standing group can be used for small groups . 
For a formal break-out group, another meeting room can be used. So as 
not to lose visual continuity, the additional meeting room can be 
situated behind the wall 55 which can be made of glass like a large 

35 window 60 and the avatars in the additional meeting room can be 
visible through the glass wall 55. 
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For the case where a user 17 is involved in a session completely 
separate from the conference, this can be represented by his avatar 5 
using a mobile phone 79. This conveys to the other participants that 
a user 17 whose avatar 5 is holding a mobile phone 79 does not have 
5 his full attention on the conference. 

Multiple conversations 

According to this embodiment, Figure 24 is a schematic of the audio 
mixer 90 for multiple conversations. It illustrates the audio mixer 

10 90 that is part of the session server 1 when more than one 
conversation is taking place simultaneously during the conference. 
There are 3 conversations taking place: Conversationl 2 01, 
Conversation 202 and Conversations 203. Conversationl 201 in the 
CONV1 mixer 94 uses the input and output streams 1, 2 and 3. 

15 Conversation2 202 in the C0NV2 mixer 95 uses the input and output 
streams 4 and 5. Conversation 203 in the CONV3 mixer 96 uses the 
input and output streams 6, 7 and 8. The mixed output 97 of mixer 
C0NV1 94 is also fed into the CONV2 mixer 95. The CONV2 mixer 95 is 
set up to combine conversationl and conversation such that the output 

20 streams 4 and 5 include both conversationl 201 and conversation 202 
but the output streams 1, 2 and 3 do not include any element of 
conver sat ion2 2 02. 

It is a further purpose of this embodiment that the audio mixer 90 can 
25 be configured to support two or more conversations simultaneously. In 
addition, it is possible to combine the main conference conversation 
with whispering such that two conversations can be heard 
simultaneously. 

30 It is also possible to combine a conversation with a digital audio 
stream 92 playing for example music so that both the music and the 
conversation can be heard simultaneously. 

Lip Synchronisation Generator 

35 According to this embodiment, Figure 25 is a block diagram of a Lip 
Sync Generator (LSG) 100 in which the microphone 12 receives voice 270 
from a user 17 and background noise 271 from an environmental location 
273 . The resulting analogue audio stream 103 generated by the 
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microphone 12 is processed by a standard sound card 102 such as a 
Sound Blaster from Creative Technologies Inc (USA) that is in the 
personal computer 3. The digital output from the sound card 104 is 
input into the LSG 100 which first reduces the background noise 271 
5 with a filter 205 and then outputs a stream of geometric positions 
101. In addition, a digital audio transform stream 105 is output from 
the LSG 100. The digital audio transform stream 105 can also be the 
same as the input audio stream 91 to the audio mixer 90. A stream of 
events 81 is also output by the LSG 100 which travels over a network 2 
10 to the event accumulator 89 on the session server 1. 



According to this embodiment, Figure 26 is a timeline of a lip sync 
generator. It illustrates that the processing in the LSG takes time 
and the output 101 lags the input 104 by time T milliseconds. 

15 

Lip sync animation types 

According to this embodiment, Figures 27a, 27b, 27c and 27d are 
diagrammatic representations of four geometric values for four lip 
sync animation types that can be used to animate a talking head 111 

20 with a mouth 112. In Figure 27a, the jaw rotation angle B is the 
angle between the jaw 107 and the upper teeth 106 first geometric 
value that can be output from the LSG. In Figure 27b, the mouth 
length L is the distance between the two corners of the mouth 109. In 
Figure 27c, the lip rotation angle A is the angle between the angle of 

25 the teeth 106 and the angle of the lip 108. In Figure 27d, the tongue 
protrusion length P is the length of protrusion of the tongue 110 from 
its rearmost position. 

Voice processing 

30 In this embodiment, the microphone records sound from the person 17 
speaking in the conference. Human voice is typically audible in the 
range 20 Hz to 20 kHz. The analogue signal 103 from the microphone 12 
is processed to produce a digital audio stream 104 sampled at 16 kHz 
and 16 bits resolution by the sound card 102 in the personal computer. 

35 Sampling at 16 kHz and 8 bits was tried but the data was too sparse to 
allow the LSG to perform well in this particular avatar conference 
configuration. The output 101 from the LSG 100 is four real numbers, 
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one for each geometric value of a lip sync animation type at a sample 
rate of 30 per second. 



According to this embodiment, Figure 28 is a flow diagram of the 
5 process followed by the LSG 100. The digital audio stream data 104 
flows into a buffer 120. At regular intervals, a discrete Fourier 
transform 121 is performed on the audio data accumulated in the buffer 
120 and a spectrum 146 is output. The spectrum 146 comprises a finite 
number of bins representing frequency ranges with the degree to which 
10 each bin is filled defining the amplitude of that frequency range. A 
jaw rotation analyser 123 outputs a value representing the jaw angle 

124. A mouth length analyser 125 outputs a value representing the 
mouth length 126. A lip rotation analyser 127 outputs a value 
representing the lip angle 128. A tongue protrusion analyser 129 

15 outputs a value representing the tongue protrusion 130. One or more 
emotion analysers 135 output strengths of emotion 136. The stream of 
spectrums 146 generated is the audio transform stream 105. The 
combination of real numbers 124, 126, 128 and 13 0 in a stream is the 
geometric position stream 101 in which one or more strengths of 

20 emotion 136 are included. The audio transform stream is compressed 
131 to produce a compressed audio stream 132 and this is combined 133 
with the geometric positions 101 to form a stream of packets 134 for 
transfer over the network 2 to the session server 1. 

25 In order that lip sync animation can take place on any head, the 
geometric values 124, 126, 128 and 13 0 for the four lip sync animation 
types are each normalised and output by the respective analysers 123, 

125, 127 and 129 in the range 0 to 1.0. 

30 The LSG 100 operates at 62.5 Hz in that a discrete Fourier transform 
is performed on the digital audio stream data 104 accumulated during 
the previous 0.016 sec. The frequency spectrum is divided into 128 
bins representing frequency ranges. The packets sent over the network 
are sent at a frequency of 30 Hz. Operation at rates in excess of 100 

35 Hz was tried, but the LSG quality deteriorated due to a reduction in 
signal. These values are settings at which the LSG 100 works, but 
this invention is not limited to these precise settings and includes 
all settings that work for this process. 
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Audio compression 

The audio compression 131 in which the stream of spectrums 105 is 
further compressed can be carried out by any of the compression- 
5 decompression routines known to experts in the field. 

Many methods of compression of audio streams include carrying out 
discreet fourier transforms as one step in the compression. It is an 
advantage of this invention that a single discreet fourier transform 
10 is used for two purposes: lip synchronisation generation and audio 
compression. This invention requires less of the personal computer 
processing power than methods in which lip sync generation and audio 
compression are performed in separate processes. 

15 Audio lag 

According to this embodiment, Figure 29 is a flow diagram illustrating 
the steps involved in the passage of a sound from the microphone 12 on 
one personal computer 3, through the sound card 102, processed by the 
LSG 100, sent to the session server 1 over the network 2, buffered and 
20 mixed in the audio mixer 90, resent over the network 2 to another 
personal computer 3, buffered and decompressed 14 0 and played on the 
headphones 13 via the sound card 102. 

The geometric and audio information in the packets is for the same 
25 period of time; in other words there is no lag within the packet 
between the geometric and audio information. This has the advantage 
of perfect timing on the lip synchronisation on replay. There is also 
the advantage of simplicity in which the two data types are combined 
in the same packet . 

30 

But there is a lag for the whole system in that, as shown in Figure 
29, the sound passes though many stages from when it is spoken into 
the microphone 12 to when it is heard in headphones 13. The largest 
element of lag may be a different element in each system design. If 
35 the network 2 is the internet then the lag caused by the internet 
could be in excess of 1 second. The greatest acceptable lag in tele- 
conversations is around 2 00-300 milliseconds although longer lags of 
500 milliseconds or more are considered acceptable by users on mobile 
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phone networks. The lag in the LSG between the digital audio input 104 
and the outputs 101 and 105 is typically in the range T = 0.05 to 0.1 
sec . 

5 Geometry from spectrum 

According to this embodiment, Figure 30a is a spectrogram 145 and 
Figure 30b is a graphical diagram of a spectrum 146 from time t on the 
spectrogram 145. The spectrum 146 comprises a finite number of bins 
representing frequency ranges with the degree to which each bin is 

10 filled defining the amplitude of that frequency range. For clarity in 
disclosing this embodiment, the spectrum 146 is segmented into just 7 
bins corresponding to rows fl to f7. As already disclosed, the number 
of bins in a spectrum 146 is likely to be much higher. The row fl 
corresponds to the lowest frequencies collected and the row f7 

15 corresponds to the highest frequencies collected, with f2-f6 covering 
frequency ranges in between. For clarity in disclosing this 
embodiment, the amplitude of each bin is split into ranges al to a6 . 
The range a6 is the largest range of maximum amplitudes. When 
displayed on a colour screen, the spectrogram 145 can be depth encoded 

20 in discrete colours such that the square in row fl is coloured with a 
colour signifying amplitude al, the square in row f2 is coloured with 
a colour signifying amplitude a4 and so on for rows f3 to f7 of the 
spectrum 146. In practice, the amplitude is likely to be stored as a 
floating point number and only split into amplitude ranges for the 

25 purposes of visualisation on the colour spectrogram 145. 

It was appreciated that the generation of geometric values for facial 
animation during speech is not a perfect science. Each person's voice 
pattern is unique and so is their facial animation during speech. For 
30 a real-time LSG to be useful, it must generate facial animation that 
is acceptable to the user. 

It was also appreciated that restricting the facial animation to four 
geometric parameters was a simplification reducing the representation 
35 of something as complicated as movements of a human face during speech 
to four values. It was foreseen that if the system worked well for 
four geometric parameters, then it might be improved by adding further 
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geometric parameters and that each parameter may be defined 
differently from the definition disclosed here. 

The approach taken to generating acceptable facial geometry for four 
5 geometric parameters from a single spectrum was an experimental and 
analytical approach. Software visualisation tools were developed for 
showing the colour spectrograms of voice as it was recorded through a 
standard microphone supplied with a low-cost headset. In the end a 
range of 64 amplitudes with 64 colours were chosen for visualising the 
10 shape of utterances on the spectrogram and used in the rules for 
determining geometry from the spectrum. Internally, the amplitude is 
a floating point real number. 

Most work analysing voice has been by researchers coming from the 
voice recognition or voice synthesis communities. Their approaches 
have been strongly linked to concepts such as phonemes , visemes, 
diphones and co-articulation. As seen in the disclosure of this 
patent, the LSG 100 has a direct route between voice spectrum and the 
geometry output without attempting to go through intermediate concepts 
such as phonemes, visemes, diphones and co-articulation. 

Other LSG attributes 

One requirement for the LSG is for it to scale to non- speech 
utterances such as singing and laughing. The need is for the avatar 
to visually represent those utterances in an acceptable manner. 
Another requirement for the LSG is for it to work with different 
people's voices and all languages. A further requirement is for the 
software code to be small enough to download from the session server 1 
over a network 2 to the client personal computer 3 without too long a 
delay. 

LSG approach and algorithms 

The approach involved creating spectrograms of simple sounds and 
recording the corresponding facial geometry made whilst speaking those 
35 sounds. The spectrums in the spectrograms were then studied to look 
for patterns that could be transferred into heuristic algorithms. 
These algorithms were then installed in the jaw, mouth, lip and tongue 
analysers 123, 125, 127 and 129. Once the system was working for 
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i 

simple sounds with algorithms in place in the analysers, it was tested 
with more complex words and different voices. Whenever the facial 
animation was found to be unacceptable the algorithms were adjusted or 
new algorithms developed to improve the facial animation. 

5 

In this embodiment, simple algorithms are disclosed for the analysers 
that work to an acceptable level on a variety of voices, languages and 
adequately for some singers. It is appreciated that these algorithms 
can be improved upon and this is a target of future research work. 

10 

The algorithm in the jaw rotation analyser 123 relates the output jaw 
angle to the energy in the high frequency bins. In general, whilst 
talking, the mouth opens further when making high frequency sounds 
than low frequency sounds. In the jaw rotation analyser 123, the 

15 higher the amplitude in the high frequency bins, the larger the jaw 
rotation and the more the mouth is open. The algorithm in the jaw 
rotation analyser 123 calculates a normalised average value 124 of the 
sum of the normalised amplitudes in the high frequency ranges f5, f6 
and f7. This algorithm in the jaw rotation analyser 123 can be 

20 improved by setting a minimum level of mean normalised amplitude in 
the high frequency ranges f5, f6 and f 7 . If the actual mean 
normalised amplitude is not above this minimum level then the output 
value 124 is set to zero. This stops the mouth opening in response to 
low levels of background noise rather than speech. 

25 

The algorithm in the mouth length analyser 125 works on frequency 
range. The wider the range of frequencies, the larger the length 
between the mouth corners 126. The standard deviation of the spectrum 
is calculated from the amplitudes in each bin in the spectrum. The 

30 mouth length 126 output by the mouth length analyser 12 5 is 
proportional to this standard deviation. The mouth length 126 is a 
normalised value from 0 to 1 . Whistling is an extreme example in 
which the mouth length 126 is very short to make a small hole through 
which air is expelled at a focused frequency. The mouth length 

35 analyser 125 can handle whistling because the standard deviation of a 
whistling sound is very small and the output mouth length 126 is 
correspondingly small. 
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The lip rotation analyser 127 looks for high amplitudes at particular 
frequencies. Lip rotations are associated with plosive sounds such as 
* s' or % t' that are in effect sudden bursts of energy at 
characteristic frequency range. Each plosive sound has a 

5 characteristic frequency bin or set of neighbouring bins. The higher 
the relative amplitude at one characteristic frequency, the larger the 
lip rotation. The lip rotation analyser 127 checks for high amplitude 
at one of these known sets of frequency bins relative to all the other 
frequency bins. The lip rotation 128 output by the lip rotation 
10 analyser 127 is proportional to the ratio between the average 
amplitude of the set of characteristic bins and the average amplitude 
of all the other frequency bins. The lip rotation 128 is a normalised 
value from 0 to 1 . 

15 The tongue protrusion analyser 129 looks for characteristic sounds 
such as *th' in which the tongue protrudes. The higher the amplitude 
of the characteristic sound, the more the tongue protrudes. 

Emotion detection 

20 It is useful to detect the emotion of a person from the person's 
voice. Once detected, the emotion can be used to modify the avatar's 
actions such that the avatar's visual behaviour matches the emotion 
conveyed by the audio. Some emotions engender large changes in body 
language and other emotions engender barely noticeable changes in body 

25 language. For a good avatar metaphor it is useful to detect emotions 
that engender large changes in body language. 

Referring again to Figure 28, the simplest emotion to detect is the 
absence of speech over time. This can be detected by a special 
30 emotion analyser 135 designed to detect absence of speech that outputs 
a strength of speech 136. If the strength of speech 136 is zero, then 
there is no speech at that time t. If the strength of speech 136 is 1 
then there is speech. 

35 An emotion that engenders large body movement is laughing. Laughing 
has a characteristic pattern that can be detected from speech. There 
is a regular pattern of sounds at a frequency of around 3-4 Hz along 
the time axis in the spectrogram 145 and characteristic high amplitude 
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such aa levels a4-a6 and a low frequency such as f5-f7 in the spectrum 
146. Laughing can be detected by a special emotion analyser 135 
designed to detect laughing. The strength of laughing 136 output is a 
normalised value in the range 0 to 1 . 

5 

Anger can be detected by an increase in amplitude. This is not always 
reliable, because, for example, moving the microphone 12 closer to the 
mouth of the user 17 may result in a significant increase in 
amplitude . 

10 

It is a further purpose of this embodiment that emotions be detected 
from the audio signal of a person speaking in near real-time and that 
the detected emotions be used to modify the movements of the avatar 
representing that person. 

15 

Geometry damping 

It was found that the raw streams of real numbers from all of the 
analysers 123, 125, 127, 129 and 135 were noisy. The values went 
through substantial fluctuation from one spectrum analysis to the 
20 next. This gave . poor facial animation in which vibrations of the 
order of 30 Hz with large amplitudes were observed during lip 
synchronisation with speech. After experimentation with damping, it 
was found that the best results came from damping each geometric 
parameter stream independently. For a parameter stream P: 

25 

Pmt = rPt + (l-r)Pmt-l 



Pmt - the modified value of the parameter P at time t 
Pt - the raw value of the parameter P 
30 r - the damping ratio 

Pmt-1 - the modified value of parameter P at time t-1 



The damping ratio used was r=0.75. It is likely that different 
methods of damping will be developed for different geometric 
35 parameters and that these may have different values for any damping 
ratios r. 
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Identifying the main speaker 

In an audio conference call # there is no need to identify the main 
speaker. The voice channels 91 can just be mixed and the users 17 
will sort out the situation if several people speak at once; if 
5 necessary a chairman will be appointed to determine the next speaker. 

In an avatar conference, it is useful to know who the main speaker is 
for several reasons : 
To plan camera shots 
10 - To stop lip synchronisation being generated from just background 
noise giving the visual effect of many avatars speaking all the 
time when they are not actually speaking 

Microphones 12 often pick up background noise 271, particularly if the 
user 17 is in an open plan office. In an ideal world, all microphones 
would only pick up the voice of the user 270 and automatically filter 
out background noise 271. In many user environmental locations 273, 
background noise 271 can be at the same amplitude as voice 270 or even 
higher. The filter 205 in the LSG 100 plays an important role in 
reducing this background noise 271 before it reaches the LSG 100. 
Where the background noise 271 is high, it is difficult for the LSG 
100 to know whether the audio stream is noise 271 or voice 270, even 
after filtering. In many cases, the LSG 100 generates a stream of 
geometric positions 101 from the digital audio stream 104 that is in 
fact just background noise 271. 

One simple way of eliminating the problem of identifying whether the 
audio stream 104 is voice 270 or background noise 271, is to request 
users 17 to turn off their microphones 12 when they are not speaking. 
30 This effect can also be implemented in a different way by requesting 
that the user 17 presses a % Push to Speak' button 272 on the avatar 
session user interface 10 whilst he speaks. If several users 17 have 
their buttons 272 depressed at the same time, then the audio mixer 90 
mixes all the active channels. 

35 

Users 17 are often multi-tasking with their hands and many users do 
not want to press a button each time they wish to speak. The ideal 
way of eliminating the problem of identifying whether the audio stream 
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104 is voice 270 or background noise 271, is to improve the filtering. 
Improved filtering will remove the need to press a button or switch 
the microphone on or ask the user to work in a quiet room. Filtering 
may be improved by using active noise reduction in which a second 
5 microphone situated away from the speaker's mouth can capture 
background noise and subtract it from the signal in the main 
microphone. As the power of personal computers grows, it will become 
possible to train software with the voice pattern of the speaker and 
to use that pattern to isolate the voice from the background noise. 

10 

It is a further purpose of this embodiment that the LSG 100 uses 
filtering and switching techniques as described above to more reliably 
generate events 81 that indicate whether a user 17 is speaking or not. 

15 LSG Architecture 

The avatar conference is a client -server architecture. The LSG 10 0 
runs on each personal computer client 3 . The alternative was to run 
the LSG 100 on the session server 1. It is better in most instances 
to run the LSG 100 on the personal computer client 3 rather than the 
20 session server 1 because (a) this uses up less network bandwidth in 
that the data rate for the combined compressed audio and geometric 
values 134 is much less than that for the digital audio stream 104 and 
(b) the network architecture is more scalable for large conferences in 
that massive session server processing demands are avoided. 

25 

The software code size for the LSG is around 20 kBytes . This has the 
advantage of being small compared to other approaches which often 
involve the necessity for large dictionaries to be on the client 
personal computer 3, usually by downloading over the network 2 from 
30 the session server 1. Such a small size of software code makes the 
LSG suitable for applications on small network devices such as mobile 
phones . 

It is a purpose of this first embodiment to disclose a process wherein 
35 sound passes through the avatar user interface system comprising the 
following steps : 

a microphone means records sound from a user of a computing 
appliance means as the user speaks; 
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a lip synchronisation generator means on the computing appliance 
means processes the sound to provide a combined audio and geometric 
position stream; 

the computing appliance means streams the combined audio and 
5 geometric position stream over the network to an audio mixer; 

the audio mixer mixes the combined audio and geometric position 
stream with any other combined audio and geometric position streams 
to produce a specific mixed audio and geometric position stream for 
each computing appliance; 
10 - the audio mixer sends each computing appliance its specific mixed 
audio and geometric position stream; 

the computing appliance plays the specific mixed audio and 
geometric position stream to its user via a loudspeaker means. 

It is a further purpose of this first embodiment to disclose a lip 
synchronisation generator process comprising a process performed at 
regular intervals on a digital audio stream flowing into a buffer of 
the following steps : 

the contents of the buffer are copied and then the buffer is 
emptied; 

a discrete fourier transform is performed on the copied contents of 
the buffer and a spectrum is output; 

one or more analysers analyse the output spectrum and each analyser 
outputs a value representing a geometric position of a part of a 
talking head. 

Camera shot direction 

One role of the software director 80, a software agent, is to decide 
and activate the cameras to form a sequence of shots. The camera shot 
30 shown in the meeting room media window depends on: 
the mode chosen 84 
the layout chosen 85 

flow of events (historical and actual) 81 
flow of shots (historical and actual) 82 
35 - flow of actions (historical and actual) 83 
timers 86 
random choice 

51 



WO 03/058518 

the cameras programmed (61-64, 71-77 etc) 



PCT/GB03/00031 



The rules for the shots can be very simple for some modes such as Mode 
Ml and fairly complex for modes such as M2 . The person programming 
5 these rules has a large degree of freedom and is in effect building an 
expert system of an expert film director. The rules are improved with 
feedback from users during trials. 

During the avatar conference it is normal for different people to 
10 speak at different times. Since each person has his own microphone 12 
and personal computer 3, it is known which avatar is associated with a 
voice stream. Events include a person stopping speaking and another 
person starting speaking. The camera shot is usually on the main 
speaker; if several people are speaking at once then a wide shot of 
15 all the participants can be shown. 

Complex shots that are difficult to do in the real world can be 
achieved with relative ease in software in a virtual world. Hollywood 
films are incorporating more and more shots filmed using a motorised 

20 robot arm to move the camera long distance in six degrees of freedom. 
This gives a 3D effect from the parallax of objects with verticals and 
horizontals moving. It has been found that this 3D effect increases 
the sense of presence in the viewer and enhances the enjoyment of the 
film. As an example of the use of this technique in an avatar 

25 conference, a moving camera can track a person as he enters the room 
and sits down. It is a further object of this invention to maximise 
the sense of presence for the viewer by using 6 degree of freedom 
camera movements . 

30 Acting direction 

Another role of the software director 80, a software agent, is to 
decide on the ambient and event animations of the avatars. This is 
equivalent to the director of a stage play defining every aspect of an 
actor's facial and body movement. The animation shown in the meeting 
35 room media window depends on at least some of: 

the mode chosen 84 

the layout chosen 85 

flow of events (historical and actual) 81 
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flow of actions (historical and actual) 83 
timers 86 
random choice 

actions 83 available in a library 87 
5 - action generator capabilities 88 as defined by action parameters 
243 

Animation actions 

Animation actions can be classified into four types: 
10 - Ambient animations (generated by software director) 
Event animations (generated by software director) 
Head/ facial animated gestures (triggered by user) 
Hand/arm/body animated gestures (triggered by user) 

Ambient animations 

An actual person is almost never still. Breathing, swaying, changing 
gaze, small head movements and many others are termed ambient 
animations. In a meeting, ambient animations depend on the role of 
the person and his culture. A speaker will usually move his hands and 
arms a lot. A listener will be less dynamic. Ambient animations are 
designed to be encouraging towards a good meeting atmosphere; 
listener's faces can be seen to smile and look positive; heads can nod 
regularly as if in agreement or in understanding; body posture can be 
upright rather than slouched. Ambient animations are generated 

automatically by the software director. 

Event animations 

Event animations are the actions associated with an event. Here are 
some examples : 

30 - a person entering the meeting room, walking to his chair, pulling 
the chair out, sitting on it and moving the chair nearer to the 
table 

the detection of emotion from the audio stream; for example, if a 
laugh is detected, the avatar can be animated as laughing 
35 - a participant has been silent for longer than a certain period, 
actions associated with the participant not being involved in the 
meeting are adopted; a method might be a certain slouching in the 
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chair that will convey visually to the other participants that this 
person is not involved much 

the participant is not able to see the meeting room media window 50 
because he is viewing another document, then his avatar could be 
5 seen reading a document 

a participant takes another session (call) . His avatar can be seen 
using a mobile phone 

Event animations are generated automatically by the software director 
10 in response to an event . 

It is a further purpose of this embodiment that the software director 
automatically generates ambient and event animations. 

15 Gaze 

Humans have clear and distinctive patterns of gaze when engaged in 
face to face situations. If the software director creates patterns of 
gaze between the avatars during the conference that meet the 
subconscious expectations of the viewer, then the viewer will 

20 experience a high sense of presence. If the software director creates 
patterns of gaze between the avatars that break the subconscious 
expectations of the viewer, then the viewer will be distracted and 
find the avatar behaviour to be disconcerting. It is a generally 
accepted research conclusion, that one of the limiting factors on the 

25 uptake of video telephony is that the patterns of gaze are 
disconcerting. The software director 80 uses rules for controlling 
the gaze of the avatars based on observations of people in meetings. 

Gestures 

30 In a meeting, a participant often wishes to convey information by body 
language gestures. The gesture is sometimes purposeful - based on an 
active decision by the participant. Examples include: 
raising his hand to show he wants to ask a question 
clapping in applause 
35 - waving to say hello 
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Body language can also be passive, often without the participant being 
aware of the body language he is sending out. Examples include: 
shaking head in disagreement with what is being said 
nodding in agreement with what is being said 
5 - slumped in a chair, bored 

In an avatar conference, a participant could select a button in the 
user interface corresponding to the body language gesture he wishes to 
convey. Other participants looking at the meeting room media window 
10 will see the gesture. Both active and passive gestures could be used. 
Gestures can be particularly useful to the chairman of a meeting who 
can respond to a gesture in choosing the next person to speak. 

It is a further purpose of this embodiment that the software director 
15 generates animated gestures in response to an active user trigger. 

Animation architecture 

The software director 80 generates a flow of animations 83 for each 
avatar 5. The animations are retrieved from an action library 87 or 
20 are generated in real time from an action generator 88. 

Actions 83 in the action library 87 are fixed actions with a fixed 
duration and fixed movement. They are usually created by motion 
capture or by key frame animation. An example is raising a hand to 
25 wave . 

Actions 83 generated by the action generator 88 are variable actions 
that are generated in real-time to action parameters 243 specified by 
the software director 80. An example is asking the action generator 
30 88 to generate a walking animation action 83 that follows a specified 
path across the meeting room floor. A possible set of action 
parameters 243 for this example are: the avatar number 8, the path 
specification, the walking style, the speed, starting conditions and 
end conditions. 

35 

If a meeting room is designed with known dimensions, then an action 
library 87 of all possible actions 83 can be compiled from motion 
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capture of an actor or key- frame animation, 
generator 88 is not used. 
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In this case, an action 



It is a further purpose of this embodiment that any action 83 for an 
5 avatar 5 during the conference can be chosen by the software director 
80 either from an action library 87 or an action generator 88. 

Animation blending 

Often the 3D position of the avatar at the end of one action 83 is not 
10 directly compatible with the 3D position at the beginning of the next 
action 83. The result is a % jump' from one frame to another in which 
hands or feet may travel as much as a metre over 1/25 second, or 
whatever the time is between frames. This is very unrealistic and 
annoying to the user. Blending of joint positions over several frames 
15 is used to reduce this problem. 

In an animation, the movement of an avatar can be defined as a set of 
joint positions at each time point or frame in the animation. The 
positions of each vertex on the skin or clothes of the avatar are 

20 determined from the joint positions and any weightings associating a 
vertex with each neighbouring joint. The main advantage of defining 
an animation as a series of sets of joint positions is that it is 
smaller than a series of sets of vertex positions. An avatar 
typically has 20-50 joints but thousands of vertices. A file with a 

25 set of joint positions stored for every 1/25 second is many times 
smaller than a similar file with vertex positions. To blend two 
actions 83 it is necessary to adjust a number of frames of animation 
in the first action 83 prior to the join and to adjust a number of 
frames of animation in the second action 83 after the join such that 

30 the last set of adjusted joint positions in the first action is 
geometrically very similar to the first set of adjusted joint 
positions in the second action. 

Blending is pretty good for joining two similar positions: this is 
35 known as a subtle blend. However, when the positions are radically 
different, the result can be completely incorrect. It is quite 
possible for arms and legs to pass through each other during a radical 
blend; this effect can be most annoying for the user. The software 
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designer designing an avatar conference system must carefully define 
each action 83 in the library of actions 87 such that all possible 
actions 83 that the software director 80 selects to follow any given 
action 83 require only a subtle blend and not a radical blend. The 
main method used to achieve this is the adoption of a limited number 
of neutral positions, with each action 83 edited until it starts in 
one neutral position and stops in another neutral position. 

Animation merging 

More than one action 83 can be merged and played simultaneously to 
form a single merged action. Actions are one of two types: 

Dominant action 

Modifying action 

A dominant action is an action involving major displacements such as 
walking. A modifying action is an action involving minor 

displacements such as ambient actions and smiling. Each action 83 in 
the library has a defined action type: either a dominant action or a 
modifying action. The most common modifying action is facial 
animation. It is possible to merge three or more actions. But only 
one action in a merged action can be a dominant action. For instance, 
the walking dominant action can be defined with smiling and lip 
synchronisation. Modifying actions are applied to the dominant action 
one frame at a time. The modifying action is defined as a relative 
movement of joints. A modifying action is * added' on top of a 
dominant action during animation. 

Animation re- targeting problem 

Each avatar 5 of a particular person is a unique size. Some avatars 
may be short and fat # others may be tall and thin. When an action 83 
is created for the action library 87 it is created on an avatar of a 
particular size. If the creation means is motion capture, then the 
action 83 will play back best on an avatar with the same size and 
shape as the person whose motion is captured. Similarly, if a skilled 
animator creates an action 83 for an avatar of a particular size, it 
will play back best on an avatar with similar size and shape. 



57 



WO 03/058518 PCT/GB03/00031 

The use of joint positions to define animations, makes it possible for 

animations created on an avatar of a particular size, to be played 

back on avatars of different sizes. It is a further purpose of this 

embodiment that any action 83 can be played on an avatar 5 of a 
5 different size and shape from the avatar 5 for which the action 83 was 
created. 



Problems occur when there is interaction between an avatar and 
attributes of the virtual environment such as chairs, tables, floor, 
10 door handles and cups. Replaying an action 83 involving contact with 
an attribute of the virtual environment on any size avatar may result 
in poor motion artefacts. Examples of poor motion artefacts include: 
avatar arms passing through tables, not grasping cups properly and 
hovering above chairs. 

15 

Re- targeting solutions 

This problem may be overcome with a commercially reasonable amount of 
effort by the simplifications of: 

morphing all avatars 5 to the same standard size and shape 
20 - preparing all actions 83 for avatars of that standard size and 
shape in a defined virtual environment 

crafting the software director 80 state machine to generate series 
of actions that work without exhibiting poor motion artefacts 
However, the photo-realism of the avatars will be severely degraded, 
25 if the avatars are all the same size and shape. 

Different application activities present different re-targeting 
solutions : 

(i) Camera control: The camera 72 viewpoint, direction, zoom may be 
30 controlled by the software director 80 such that poor motion 

artefacts are not shown to the viewer. 

(ii) Aspect ratio: In a meeting activity, the avatar user interface 
window 260 aspect ratio might be high with the window wide and 
thin such that only the upper bodies of the avatars are 

35 visible; in this way, accurate animation of the feet and 

posterior is not needed. 

(iii) Avatar size range: Avatars are scaled across a design size 
range between a minimum and a maximum. Very small avatars are 
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scaled up to a minimum size, very large avatars are scaled down 
to a maximum size and the rest are spread between. In this 
way, taller people have taller avatars than shorter people's 
avatars. The environment is designed to cope with avatars in 
5 the design size range. The software director 80 state machine 

is crafted to generate series of actions that work without 
exhibiting poor motion artefacts for avatars within the design 
size range. 

(iv) Adaptive action control: Actions 83 such as sitting in a chair 
10 may be animated adaptively to avoid particular motion 

artefacts. An example is sitting in a chair. Avatars of 
different heights and different posterior sizes might either 
float above the chair seat or break through it. Adapting the 
sitting down action 83 to the avatar size by raising or 
15 lowering the whole avatar during the sitting process, solves 

this problem. In this case, the action 83 is probably 
generated by the action generator 88 based on action parameters 
243 . 

(v) Morphed body parts: For example, to help with the grasping 
20 problem, all avatars could be given the same size arms and 

hands. In this way, it is only necessary to position the 
avatar's shoulder joint in a fixed position relative to the 
prop 3 83 for the action 83 to be executed without poor motion 
artefacts . 

25 

It is a further purpose of this embodiment that one or more re- 
targeting solutions are used to avoid poor motion artefacts. 

Speech and text means 

30 According to this embodiment, Figure 31 is a block diagram of the 
session server 1 containing: an audio recording 185, an event 
accumulator 89, a speech recognition engine 182, voice profiles of 
participants 184, a text transcript 183, a translation engine 186, 
translated text 187, a text to speech engine 188, a voice profile 184 

35 of the voice used in the text to speech engine 188, a text chat engine 
189 and an e-mail engine 190. The software engines are running in 
memory 346. The session server 1 is connected over a network 2 to a 
speech recognition service 192, a text to speech service 193 and a 
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Conference recording and playback 

It is quite common for a person who should have been in the conference 
5 to miss it and to wish to know what happened. The conference can be 
recorded for later playback by the person who missed the conference. 

The conference can be stored as a linear audio file 185 and a time- 
stamped event accumulator 89. Events include: 
10 - person enters conference 
new speaker starts 
new agenda item started 



On playback, the audio recording 185 can be compressed in length to 
15 reduce the amount of time a person needs to spend listening to the 
audio recording. For example, periods of time in which there was no 
speech can be removed. Also, the time axis can be compressed such 
that playback takes less time than the original conference took. The 
playback speed eg 125% of normal speed, can be controlled by the 
20 person listening. The person playing back the conference can also use 
the event accumulator 89 as key points at which to start listening to 
the recording. For example, if he is only interested in agenda item 
number 3 then he can skip to the point at which the chairman has noted 
that agenda item number three started. 

25 

Speech recognition 

As speech recognition engine technology improves, it may become 
feasible for a high enough quality text transcript 183 of the meeting 
that is acceptable to users to be automatically produced by a speech 

30 recognition engine 182 from the audio recording 185 and event 
accumulator 89 using voice profiles 184 to improve the speech 
recognition. The text transcript 183 can be generated by the speech 
recognition engine 182 after the conference or in near real-time 
during the conference. A speech recognition service 192 may be used 

35 instead of having a speech recognition engine 182 on the session 
server 1 . 
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Text translation 

The text transcript 183 can also be translated to translated text 187 
in another language using a translation engine 186 present on the 
session server 1 or by a network translation service 191 over a 
5 network 2. The translated text 187 can be generated from the text 
transcript 183 after the conference or in near real-time during the 
conference . 

Text to speech; audio translation 

As text to speech engine technology, text translation engine 
technology and speech recognition engine technology improve in quality 
and speed, it may become feasible for a high enough quality near real- 
time audio translation of the meeting that is acceptable to users to 
be automatically produced by a speech recognition engine 182, a text 
translation engine 186 and a text to speech engine 188 from the audio 
104. Eventually, each participant can define the language spoken and 
the language to be listened to such that a true multi-lingual avatar 
conference can take place. A text to speech conversion service 193 
may be used instead of having a text to speech engine 188 on the 
session server 1 . 

Text Chat 

During the conference, participants can see text chat in a dedicated 
window 26 driven by a text chat engine 189 on the session server 1. A 
25 participant can input and send text messages to all participants or 
just to selected participants. 

The text chat window 26 can be used to show any or all of: text sent 
by a text chat engine 189, events 89 described in a textual format, a 
30 text transcript 183 and translated text 187. The text chat window can 
be set to the preferred language of the user such that all text is 
translated and displayed in the text chat window 26 in the preferred 
language. Text can be shown twice: in the language in which it was 
generated and in translation. 

35 

e-mail 

Following the conference, the e-mail engine 190 can send copies of 
some or all of the text generated during the conference in e-mail form 
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to the e-mail addresses of participants and also to those who could 
not attend. 



The e-mail engine 190 can also be used as an e-mail reflector for 
5 participants in which e-mails concerning the conference whether 
before, during or after the conference, are sent to the e-mail engine 
which will then immediately forward copies to all participants. 

Participant roles 

10 During an avatar conference, users 17 can have identical roles from 
the point of view of system functionality or they can be assigned 
different roles with different avatar session user interfaces 10. 

A user 17 in a Chairman role can be provided with functionality to 
15 enable him to: 

Remove a user from the conference 
Select a speaker to speak next 

A user 17 in a Secretary role can type minutes . 

20 

A user 17 in a Teacher role can control the display seen on all 
personal computers 3 in a presentation. 

Participant performance 

25 During an avatar conference, the activity of users 17 can be recorded 
and fed back to participants. If a user 17 has not spoken for a 
period of time, an event animation is used such that his avatar 5 can 
be animated in a way that shows his lack of recent participation. The 
avatar might sink down in the chair and appear to withdraw from the 

30 conference. If this visual withdrawal is noticed by other 

participants, then they have the opportunity to try to involve the 
quiet user in the conference. Alternatively, statistics of % of the 
conference time that each person has spoken for might be shown. This 
will show up users who might be hogging the conversation and others 

35 who might be lurking without saying anything. Real-time performance 
feedback can provide the participants as a team with a tool for making 
their conferences more effective. 
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In applications such as education or training, participant performance 
data such as attendance records can be available to teachers . Storage 
and access to information on participants performance is liable to be 
regulated by laws in different countries. 

5 

Some of the performance data available includes : 

Number of times a person has spoken at the conference 

Average length of time of each interaction 

Total length of time speaking 
10 % of conference time speaking 

Attendance at a series of conferences: % of conferences 

Attendance at a series of conferences: % of time 

Number of slides presented 

Number of times whiteboard used 
15 - Number of times chat used 

Number of times whispering used 

Webcam use 

There are occasions when it is useful to see live video 336 of a 
20 participant or an event at a location that a participant wishes the 
conference to see. Referring again to Figure 11 , the video (or 
streaming webcast) 33 6 can come from a webcam 29 situated on the 
display device 264 of a user 17. Alternatively, the video 336 can 
come from any other type of video camera 2 9 connected to a personal 
25 computer 3 on the network 2. The quality of the streaming video 336 
seen by each participant will vary with the bandwidth available to the 
participant. It can vary from one frame every few seconds for one 
participant with a low bandwidth connection to full frame rate for a 
participant with a high bandwidth network connection. 

30 

The resolution in pixels of the webcam broadcast 336 is usually small 
and the software director 80 shows the webcam in a correspondingly 
small window. To avoid seeing both a person live and his avatar at 
the same time, the avatar from whose webcam 29 the broadcast 3 36 is 
35 streaming must leave. To maintain the metaphor, the avatar walks out 
of the room before the webcast 336 starts and walks back in when it 
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finishes. The streaming video webcast 336 from the webcam 29 is shown 
on the screen 53 . 



SECOND EMBODIMENT 

5 According to this second embodiment, Figure 32 is a block diagram of 
an apparatus for holding an avatar user interface session in 
accordance with a second embodiment of the present invention. In this 
embodiment, the apparatus comprises a plurality of personal computers 
3 that are connected by a network 2 to a session server 1, an avatar 
10 hosting server 4 containing avatars 5 and a telephone network 155 with 
telephones 150 and a telephone server 154. 

In this second embodiment voice is carried over either the telephone 
network 155 or the network 2 and data is carried over either the 
15 telephone network 155 or the network 2. 

IP/PSTN audio architecture 

Currently, there is large lag in the existing public internet and the 
quality of voice over the internet protocol (VoIP) is much less than 
20 for the PSTN telephone network or mobile networks such as GSM or 3G. 

Two main protocols exist for transmitting over an IP network: HTTP and 
UDP. HTTP checks that each packet is received. This checking is the 
main cause of lag. UDP does not check and typically has much less 

25 lag. However, UDP is considered a security risk for companies and 
companies typically configure their firewalls to prevent UDP from 
getting through. A UDP system that does not work for most companies 
will not be purchased. In the future, new versions of IP such as IP 
v6, may improve the quality and access of VoIP such that it rivals 

30 that of telephone networks. 

The main method for remote conferencing today is telephone conference 
calls using the PSTN, mobile networks and a conference server for 
mixing the calls. Telephone conference calls are expensive, not only 
35 for the calls but also for the service of the session server. Anyone 
can access these conferences from wired or wireless handsets. 
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For an avatar user interface session using VoIP, access is limited to 
those who have a microphone and headphone on their computers and who 
are situated by a networked computer. Someone who does not have a 
networked computer with microphone and headphone cannot participate in 
5 an avatar user interface session using VoIP as disclosed in the first 
embodiment . 

The use of IP for audio, as disclosed in the first embodiment, avoids 
the cost of the telephone calls. To make an avatar user interface 
10 session convenient and available to all those wishing to participate, 
it can be an advantage to have combined IP and telephone networks as 
disclosed in this second embodiment. 

According to this second embodiment, a telephone server 154 is 
15 connected to the IP network 2. Party#l 151 can use his telephone 150 
over a telephone line 155 to a telephone server 154 and his personal 
computer 3 on network 2. Party#2 152 can use his headset 11 and his 
personal computer 3 over network 2 . Party#3 153 can use his telephone 
150 over a telephone line 155 to the telephone server 154 and not see 
20 the avatar user interface session visually. Party#4 158 can use his 
mobile telephone 157 over a mobile telephone network 159 to a mobile 
telephone server 156. When mobile handsets advance and 3G mobile 
infrastructure is in place it will be possible for audio and data to 
be used simultaneously on a mobile handset. In this way Party#4 can 
25 transfer both voice and data over the mobile network 159. Party#4 
could wear a hands -free headset for audio and look at the screen of 
his mobile handset to see the avatar user interface session. 

The audio mixer 90 can be resident on either the session server 1 or 
30 on a telephone server 154 or 156 on a separate computer. 

The Lip Sync Generator (LSG) 100 is normally present on the personal 
computer 3 through which it is connected to the sound card 102. When 
a telephone connection 155 is used, the LSG functionality 100 can be 
35 present on a server, either the session server 1 or the telephone 
server 154. The geometric positions stream 101 and the audio 
transform stream 105 can then be routed to the personal computers 3 
over the network 2 or to a mobile device over the mobile network 159. 
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It is a further purpose of this second embodiment that voice or data 
transfers in the avatar user interface session can be over a plurality 
of networks of any types connected by devices such as network switches 
5 or network routers. Examples of networks include: the internet, 
intranets, extranets, Virtual Private Networks (VPNs), GSM mobile 
networks, GPRS mobile networks, 3G mobile networks, satellite 
networks . 

10 It is a further purpose of this second embodiment that communication 
appliances in the avatar user interface session can be any sorts of 
devices including but not limited to: personal computers, mobile 
telephones, networked personal digital assistants, networked computer 
games consoles, interactive digital televisions, laptop computers. 

15 

It is a further purpose of this second embodiment that the system 
architecture can be of any type including client server and peer to 
peer and that any item of system functionality disclosed in this 
embodiment can be resident on any device. Any communication appliance 

20 might also act as a server as well as a client. As an example, it 
will be appreciated that the session server 1 does not need to be an 
independent unit and that a computing appliance 3 could run both the 
functionality of the session server 1 and the avatar user interface 
160. It will similarly be appreciated that the software 

25 functionalities and hardware capabilities of many servers could be 
combined into a single computing appliance 3. As a further example, 
the session server 1, the avatar hosting server 4, the avatar hosting 
registry 226 and the avatar agent hosting server 321 could be combined 
in one computing appliance 3 . 

30 

THIRD EMBODIMENT 

In this third embodiment the format of the avatars in each 
communication appliance is appropriate to the computing power, 
graphics processing power and display size of the computing appliance 
35 such that real-time visualisation in the avatar user interface system 
can be achieved. 

Animations of avatars 5 at much less than 12 frames per second look 
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jerky and reduce the sense of presence felt during a session on an 
avatar user interface system. 



Different 3D representations 

5 Avatar computer models can be in different mathematical 3D 
representations. Possible representations include but are not limited 
to: triangles, quadrangles, other n-sided polygons, B-spline surfaces, 
NURBS and subdivision surfaces. It is a further purpose of this third 
embodiment that the format of a 3D avatar 3 9 can be any 3D 
10 mathematical representation. 

Progressive 3D representation 

Some representations can be progressive 3D representations in which an 
actual format displayed can be an instantiation of a representation of 
15 arbitrary size on a continuum from low size to high size. In this 
way, an instantiation can be chosen that is optimal for the power of 
the computing appliance. 

Animatable image representation 

20 In addition to 3D representations, avatars may be represented in other 
ways. One way includes an animated image representation. 

According to this third embodiment, Figure 33 is a schematic diagram 
of an animatable image 380. There are a minimum of two parts to the 
25 image: an animatable image avatar 3 82 in the foreground and a 
background image 381. 

In a basic representation, an animatable image 380 may be described as 
a talking post card in which a talking avatar 382 and optionally a 
30 prop image 3 83 are superimposed in front of a fixed background image 
381. The background image 3 81 is usually photo-realistic. The 
animatable image avatar 382 is usually photo-realistic. 

According to this third embodiment, Figure 34 is a schematic diagram 
35 of an animatable image avatar 382. For the purposes of animation, the 
animatable image avatar 382 is considered to be split into five 
animatable avatar segments 395: 
(i) upper body segment 390 
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(ii) jaw and mouth segment 391 

(iii) eyes and eyebrows segment 392 

(iv) head segment 393 

(v) face segment 3 94 

Each animatable avatar segment 3 95 has a set of one or more different 
images representing that segment. The upper body 3 90 segment normally 
has only one image in its set. 

Figure 35 is a schematic diagram of a set of four state images 425 for 
the jaw and mouth segment 391 showing the jaw and mouth in four 
states: neutral 470, happy 471, sad 472 and laughing 473. It is 
usual, for a high fidelity animation to be possible, that the jaw and 
mouth segment 391 has several more state images in its set 425. The 
eyes and eyebrows segment 392 has at least two state images 425: eyes 
closed and eyes open. The head segment 393 normally has several state 
images 425 in its set with the head at slightly different 
orientations. The face segment 394 normally has several state images 
425 for different facial expressions in which wrinkles play an 
important role. 

Figure 36 is a tree diagram of the hierarchy of animatable avatar 
image components. A complete set of images 424 for playing an 
animatable image 380 comprises the background 381, prop 383 and for 
25 each avatar segment 395 the set of state images 425. 

The animatable image avatar segments 3 95 in this embodiment are not 

limited to the five disclosed animatable image avatar segments 395. 

The animatable image avatar 382 might be split into more or less 
30 segments . 

Animatable image generation 

Figure 37 is a schematic diagram of an animatable image generator 397 
resident on an avatar hosting server 4 . The animatable image 
35 generator 397 is based on an avatar player engine 210. An animatable 
image 380 comprising a complete set of images 424 may be generated by 
the avatar player engine 210 from a photo-realistic avatar 238 and a 
virtual background scene 65 using a virtual camera 61. The photo- 
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realistic avatar 238 is posed in front of camera 61 to form a neutral 
pose defined as an action 83 in which the photo-realistic avatar 238 
looks forward, eyes open, neutral expression and mouth closed. This 
pose when viewed from camera 61 generates a base animatable image 
5 avatar 382. When the photo- real is tic avatar 238 is removed, the image 
of the scene 65 viewed from camera 61 is the background image 381. 
The set of state images 425 for an animatable avatar segment 395 are 
generated by applying a predefined set of poses as actions 83 in the 
animatable image generator 397. 

10 

Figure 38 is a schematic diagram of an apparatus for animatable image 
generation 398. If a photo-realistic avatar 238 of a subject person 
428 is not available, an animatable image 380 may be generated from a 
single photo-realistic image 399 of a person in front of a background. 
15 A skilled person 427 will use image processing software 426 running in 
memory 345 on a personal computer 3 to process the image 39 9 to define 
the complete set of images 424 . 

This invention is not limited to using animatable image generators 3 97 
20 and 398. For example, a complete set of images 424 could be generated 
from video 336 or a set of still images taken of the subject person. 
The animatable image generator 397 could be resident on a personal 
computer 3 . 

25 Animatable image playing 

Referring again to Figure 20, the animation of an animatable image 380 
is generated and played in a similar way to that of a 3D avatar 39. A 
software director 80 generates the actions 83 that are played by the 
player 210. The main difference is that the software director 80, the 
30 player 210 and the other components in Figure 20 are designed to work 
with animatable images 380 instead of 3D avatars 39 and scenes. 

The animatable image avatar 382 is animated in the player 210 from 
actions 83 by a combination of methods that are now disclosed. The 
35 animatable image avatar 382 is normally based on a front view of the 
avatar covering at least the face, but rarely descending below the 
shoulders . This focus on the face removes the need to attempt to 
animate upper body movements such as arm gestures and lower body 
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movements such as walking or even turning the head by more than a few 
degrees . 



There are five animation action types 83 generated by the software 
5 director 80 for the animation of an animatable image avatar 382 by a 
player 210: 

A. Body movement 

B. Head movement 

C. Lip synchronisation 
10 D. Eye movement 

E. Facial expression 

The body movement action A is limited to a combination of horizontal 
translation, vertical translation and rotation relative to the 
15 background image 381. A body movement action A. affects all five 
animatable avatar segments 390-394. The five animatable segments 390- 
394 are moved according to a body movement action A as if they were 
locked together. 

20 The head movement action B is limited to two rotational components 
about the middle of the neck. A first rotation component left-right 
equivalent to shaking ones head and a second rotation component up- 
down equivalent to nodding ones head. A head movement action B 
affects the four animatable avatar segments 391-394. The four 

25 animatable segments 3 91-3 94 are moved according to a head movement 
action B as if they were locked together. 

The two actions A and B are added together to give a combined head and 
body movement . 

30 

The lip synchronisation movement action C affects only the jaw and 
mouth segment 391. The eye movement action D affects only the eye and 
eyebrow segment 392. The facial expression action E affects only the 
facial segment 394. 

35 

The three actions C, D and E are applied locally to their respective 
segments after the actions A and B have been applied. 
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Two forms of morphing are used: at segment boundaries and between 
images in a set. At segment boundaries such as the neck which lies 
between the body segment 390 and the head segment 393, image morphing 
is used to stretch the image on one or both sides of the boundary. 
5 Between image morphing is used where there is a gradual progression 
from one image in the set to another for a particular segment. 

Animation of image avatars 3 82 is not limited to the apparatus and 
methods disclosed above, but may be extended to any image based 
10 method. 

This embodiment is not limited to one animatable image avatar 382 
superimposed in front of the background image 381 but may contain two 
or more animatable image avatars 382. Referring again to Figures 16b, 
15 16c and 16d, it can be seen that several animatable image avatars 3 82 
may be generated from several avatars 5 . 

Referring again to Figures 17a, 17b and 17c, three layouts Layout 1, 
Layout 2 and Layout 3 may be used for displaying multiple animatable 
20 image avatars 382 with one or more background images 3 81 on a single 
display device 264. This invention is not limited to displaying 
Layouts 1-3 but may cover any layout that fits the application that 
this avatar user interface system invention is used for. 

25 Props 215 may be converted into prop images 383 which are animated in 
front of the background image 3 81. The animated prop image 3 83 may 
appear as part of a background image 381; an example is a tree bending 
in the wind. Or the animated prop image 383 may appear separate from 
the background image 381; an example is a bird flying across the 

30 background image . 

It is a further purpose of this third embodiment that the format of an 
avatar 5 can be any animatable non-3D mathematical representation 
including animated image representations. 

35 

Computing appliance variety 

A computing appliance may be very powerful with a processor running at 
speeds in excess of 2 GHz, more than 512 MB of memory 345, a display 
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device 264 with more than 1 million pixels and a specialist 3D 
graphics chip such as an Nvidia GeForce 3 from Nvidia Inc (USA) . Such 
a computing appliance can easily render real-time animation at 20 
frames per second of 10-20 avatars 5 in the full generic format as 
5 disclosed in the first embodiment. 

However, many computing appliances are less powerful and do not have 
specialist 3D processing hardware. Processing power is usually 
constrained so as not to use up battery life on lightweight portable 

10 devices with small batteries. Examples include mobile phones and 
wireless personal digital assistant appliances. Less powerful 
computing appliances usually have less memory than more powerful 
computing appliances. Less powerful computing appliances such as 
mobile phones may have very small display device 264 sizes with fewer 

15 than 5,000 pixels. 

3D avatars with lower levels of detail may be used on intermediate 
power computing appliances to achieve the desired animation 
performance. Avatars with lower levels of detail typically have fewer 
20 polygons and smaller texture maps. This is good for achieving higher 
frame rates and uses less memory but the downside is that the visual 
quality of the 3D avatar is less good. 

For high power computing appliances with a lot of avatars in the 
scene, a combination of low and high levels of detail avatars may be 
used to achieve a good frame rate. The closest avatars to the camera 
might be high level of detail and those furthest away might be low 
level of detail avatars. This can be achieved by having two or more 
level of detail avatars available and switching between them. 
Alternatively a progressive avatar approach might be used. 

Animated image representations may be used on low power computing 
appliances to achieve the desired animation performance. These use 
less computing power and memory than 3D representations. 
35 

Multiple avatar formats and converters 

According to this third embodiment, Figure 39 is a block diagram of an 
apparatus for holding an avatar user interface session in accordance 
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with a third embodiment of the present invention. In this third 
embodiment, the apparatus comprises computing appliance 160 with a 
specific avatar 5 in format Al, computing appliance 161 with a 
specific avatar 5 in format A2 , computing appliance 167 with a 
5 specific avatar 5 in format A3 and an avatar converter software 164 of 
type C3 stored in memory 345. The computing appliances 162 , 163 and 
167 are connected by a network 2 to an avatar hosting server 4 
containing a substantial number of avatars 5, database 6, avatar 
converter software 164 of types CI, C2 stored in memory 344 and 
10 specific avatars 5 of formats Al and A2 . 

It is a purpose of this third embodiment that the avatar hosting 
server 4 has avatar converter software 164 such as CI that can convert 
an avatar 5 into a specific avatar 5 with a format such as Al at a 
15 different level of detail. The specific avatar 5 in format Al is then 
transmitted over the network 2 to a computing appliance 160 for which 
the specific avatar 5 of format Al is suitable. 



An alternative approach is to have avatar converter software 164 C3 in 
20 memory 345 on a computing appliance 167 such that an avatar 5 can be 
converted to a specific avatar 5 format A3 locally on the computing 
appliance 167. Software techniques such as progressive meshes or 
variable levels of detail employed in the avatar converter software 
164 C3 known to those skilled in the art might convert the avatar 5 to 
25 several different formats during the conference depending on the 
graphics load on the computing appliance. 



It is a further purpose of this third embodiment that a computing 
appliance 167 can contain avatar converter software 164 C3 for which 
30 the specific avatar 5 of format A3 is suitable at any one instant . 
This invention is not limited to one type of avatar converter software 
164 running on a computing appliance 167 but may allow any number of 
avatar converter software 164 on a computing appliance 167. 

35 Different avatars 5 will contain different visual data depending on 
how they were originally generated; for instance, photo-realistic 
avatars 238, parameter avatars 232 and animatable image avatars 382 
will be based on different raw data. 
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The avatar hosting server 4 usually stores the raw data from which the 
avatar 5 was generated, including manual input used to generate the 
avatar 5. The raw data for photo-realistic avatars is usually in the 
5 form of digital images 19. In this way, an avatar 5 can be 
regenerated automatically from the images 19. Moreover, if any 
technological improvements are made to the avatar 5, then a newer 
version of the avatar 5 can be generated automatically from the images 
19 and replace the older version. 

10 

The suite of avatar converter software 164 should ideally be capable 
of converting any avatar 5 to any requested format . These conversions 
will not always be of the highest quality due to missing information. 
For instance, an animatable image avatar 382 cannot easily be 
15 converted into a photo-realistic avatar 238 because there is no 
information on the body shape. 

The avatar hosting server 4 usually stores all formats of the avatar 5 
that have been previously requested. The reason is to maximise 

20 response time for requests for that avatar in a particular format. If 
an avatar must be first converted into a particular format then it 
will take the avatar hosting server 4 longer to service a request. It 
is a benefit to the user that his request for an avatar is serviced as 
quickly as possible. However, storing several formats of each avatar 

25 uses up a lot of server space. To conserve server storage space, it 
may be pragmatic data management to delete formats that have not been 
used for a considerable time and formats that have been superseded by 
new versions . 



30 It is a further purpose of this third embodiment that the 
communication session on the avatar user interface system invention 
may involve any combination of 3D or animatable image representations 
on the computing appliances. At one extreme, all the computing 
appliances may be high power personal computers 3 and use photo- 

35 realistic avatar 238 representations. At the other extreme, all the 
computing appliances may be mobile phones and use animatable image 380 
representations. In a typical session, one computing appliance might 
use 3D avatars 39 with high numbers of polygons, a second computing 

74 



WO 03/058518 PCT/GB03/00031 
appliance might use NURBS based 3D avatars 39 and a third computing 
appliance might use animatable images 380. 



Major conference 

5 In a major conference, with around 50-1,000 participants, a number of 
techniques are used to run the avatar user interface session on any 
computing appliance 167. It is likely to be impossible for some time 
that a personal computer 3 would have enough power to fully animate 
and render 1,000 avatars 5 at the same time. In large conferences, it 
10 is still useful for the complete audience to be seen to provide 
participants with an ambience matching the scale of the event. There 
are many ways for the software director 80 to achieve this ambience: 
pan quickly across a single image taken at a real conference of the 
appropriate size; the real conference room must match the avatar 
15 user interface session room 

store short video clips in the personal computer 3 taken at a real 
conference of the appropriate size and replay them from time to 
time 

when there is a question from the audience, zoom in quickly from 
20 the real audience image to a virtual close up of the avatar 

surrounded by other avatars 

The chairman of a large avatar user interface session needs to handle 
questions from a lot of people. 

25 

Figure 40 is a schematic layout of a major conference user interface 
functionality 2 91 for the Chairman consisting of a list of attendees 
with names 244 and organisations 2 93 wishing to ask questions, a 
button 294 for the Chairman to permit an attendee to speak. 
30 Attendees have buttons 290 to indicate a desire to ask a question and 
buttons 295 for testing their microphones before asking a question. 

The avatar 5 of a user attendee 17 who has pushed the ask button 2 90, 
the software director 80 will raise the hand of the avatar 5. When 
35 the Chairman presses the button 294 to give the attendee 17 permission 
to speak, the software director 80 will lower the hand of the avatar 5 
and connect the user 17 's audio input channel 91 to the audio mixer 
90. 
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On pushing the Test Microphone button 295, a user 17' s microphone 12 
will be connected over the network 12 to the audio mixer 9. A short 
dialogue will take place using pre-recorded sound files on the audio 
5 mixer in which the user 17 is able to verify that his microphone 12 
works and is connected to the audio mixer 90. This test procedure 
should reduce the frequency of occurrence of a user 17 trying to speak 
but not being heard by the conference attendees because of a 
microphone problem. 

10 

It is often the case in large conferences that there are breaks 
between presentations during which people chat . The whispering 
capability of this invention will permit a large number of whispered 
conversations of 2 or more people during these breaks. 

15 

It is a purpose of this invention that large conferences of many 
thousands of people can be successfully held. 

It is a purpose of this third embodiment to disclose a process wherein 
20 a remote presenting user presents a presentation remotely comprising 
the following steps: 

the remote presenting user starts a prepared presentation; 

remote audience users watch the avatar of the remote presenting 

user perform the prepared presentation; 
25 - present audience users present physically together in a theatre 

watch a projection of the avatar of the remote presenting user 

perform the prepared presentation; 

the prepared presentation ends; 

a remote audience user asks a question; 
30 - the remote presenting user views the avatar of the remote audience 

user asking the question from amongst a single virtual audience and 

the avatar of the remote audience user gazes at the remote 

presenting user; 

the present audience users view the avatar of the remote audience 
35 user asking the question from amongst a single virtual audience 

around the avatar of the remote presenting user and the avatar of 
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the remote audience user gazes at the avatar of the remote 
presenting user. 

FOURTH EMBODIMENT 

5 In this fourth embodiment, rather than each participant being in a 
separate location, the apparatus of the invention supports two or more 
participants at one location. 

Speaker phone 

10 It is common in audio conferences for several people to congregate 
around a speaker phone in a single room for a conference call which 
includes at least one other location. Often the speaker phone has 
several microphones attached to it that are placed near different 
people around the meeting table. In this way, the people in the room 

15 can communicate directly via physical document exchange, body 
language, whispering and facial expressions in parallel to the formal 
audio exchanges . 

Shared display device 

Figure 41 is a schematic layout of an apparatus for holding an avatar 
user interface session in accordance with a fourth embodiment of the 
present invention. A personal computer 3 with a computer cabinet 16 
contains a wireless transmitter/receiver 170. Participants 17 

* Albert' , % Bruce' and 'Charles' sit around a table 172 at an 
environmental location 273 with each participant 17 wearing a wireless 
headset 171 including microphone 12 and earphone 13. Each wireless 
headset 171 has an identified owner eg Albert. A large display device 
264 shows the avatars 5 of all participants on the avatar user 
interface session other than those participants 17 around the table 
172 at this location. Means for controlling the computer such as a 
keyboard 14 or a mouse 15 are available for use by the participants 
17. The environmental location 273 is usually a room such as a 
meeting room. 

35 In this way a participant 17 eg Albert can see all other participants 
either physically 17 ie Bruce and Charles, or as avatars 5. When a 
participant speaks, since the wireless headset 171 is identified as 
being owned by a specified person eg Albert, it is possible for the 
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lipsync to be applied to the correct 
headset may be identified by means 
the headset . 



PCT/GB03/00031 
avatar 5 of Albert. The wireless 
of an identification chip inside 



5 Sound mixing 

In a further refinement of this embodiment, one or more loudspeakers 
173 are used for broadcasting sound to the participants 17 and each 
user has a wireless microphone 12 linked to the identity of the user. 
Signals from the wireless microphone 12 are transmitted to the 
10 receiver 170. To prevent audio feedback between the loudspeakers 173 
and the microphones 12, the audio mixer 90 does not mix in the audio 
streams from the microphones 12 of all the participants at that 
location. 



15 In the case where there are two or more loudspeakers 173, 3D sound can 
be used to increase the sense of co-presence of the participants. If 
an avatar 5 on the far left of the display device 264 is talking, then 
the 3D sound can be mixed locally to appear as if it is coming from 
the mouth of that avatar. In this case, the sound volume from a 

20 loudspeaker on the left would be louder than that from a loudspeaker 
on the right. The audio mixer 90 is not involved in generating the 3D 
sound . 

Figure 41a is a schematic of the 3D sound processing. The mixer 90 
generates an audio output stream 93 which travels over the network 2 

25 to the PC 3 . A splitter 141 splits off the geometric positions 101 to 
the player 210. The splitter 141 sends the remaining audio transform 
105 to the decompressor 140. The decompressor 140 generates digital 
voice 104 and streams it to the 3D sound generator 143. The player 20 
calculates the pixel coordinates 142 on the display 264 of the mouth 

30 of the avatar 5 that is speaking and streams them to the 3D sound 
generator 143. The 3D sound generator 143, uses the known positions 
of the loudspeakers 173 relative to the display 264, to generate 
digital voice signals 104 to the sound card 102 which streams analogue 
voice 103 to the loudspeakers 173 . 

35 

This fourth embodiment has the advantage of allowing an avatar user 
interface session to take place with more than one person at a single 
location. It is also scalable for the case where there are two or 
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more locations, with more than one participant at each location. 
Furthermore, it has the advantage of greatly increasing the sense of 
presence by showing all the non-present people on the call as avatars. 

5 FIFTH EMBODIMENT 

It is a purpose of this fifth embodiment, that the avatar user 
interface system comprises an integrated multi-media communication 
system based around photo-realistic avatars for communication with 
people and intelligent agents in both synchronous and asynchronous 
10 ways that is supportive of multi - tasking . 

Multi -tasking 

Different communication activities have different efficiencies. The 
following table shows rough estimates of average speed of different 
15 communication activities: 



Communication 


Speed 


Average time per 


Relative 


Relative 


Activity 


(words/mi n) 


person per 100 


(%) 


(ratio) 






words (mins) 






Talking/Listening 


190 


0.52 


57% 


4 


Reading 


333 


0.30 


100% 


7 


Typing (keyboard) 


50 


2.00 


15% 


1 


Typing (SMS) 


10 


10.00 


3% 


0.2 



Social trends in the workplace include rising productivity and rising 
salaries. This points to an increasing need for employees to be more 
20 productive by multi-tasking: carrying out more than one task at a 
time. All other things being equal, a user of a communication system 
is likely to prefer a communication system that is designed to support 
multi-tasking. 

25 Some tasks can be carried out at the same time whilst others cannot. 
The following table shows a set of 'rules of thumb' for which 
communication activities can be multi-tasked. Each mode shows a pair 
of tasks that can be performed together. It is assumed that three 
tasks cannot be performed simultaneously by most people. 

30 
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Mode 


Talk 


Listen 


Read 


Visual 
Media 
(Silent) 


Type/ 
write 


Manual 
eg making 
tea 


Sign / face / 

body 

language 


1 


X 










X 




2 


X 












X 


3 




X 


X 










4 




X 




X 








5 




X 






X 






6 




X 








X 




7 




X 










X 



Modes 3 and 5 in the table above are perhaps the most common modes of 
multi-tasking on a personal computer between different task types. An 
5 integrated avatar user interface system should support reading and 
typing tasks whilst listening. 

Another type of multi -tasking is time efficiency of verbal 
communication. Whilst in an avatar user interface session, the 
10 participant should be able to carry out other voice tasks in the 
periods when the conversation of the session is not important to him. 
The following voice tasks are possible whilst in an avatar user 
interface session: 

listen to voice-mail 
15 - speak voice-mail 
make a voice call 
receive a voice call 

interact with conversational intelligent avatar agents 
interact with user interfaces 

20 

Voice task functions 

Some functional considerations are important for voice tasks: 

mixing of conference audio with the incoming voice signal so that 
participants can be passively aware of what is happening in the 
25 conference. An example is someone asking you a question whilst you 

are on another task, in which case you are likely to hear "What do 
you think YourName?" and react appropriately 
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easy to use switchboard between voice tasks. Multi-tasking 
requires speed and efficiency in switching between synchronous 
voice tasks such as putting a party on hold 

rapid directory look-up of a person, indication of whether a person 
is logged on / active on his personal computer and automatic 
dialling of a voice call. 

visual status of voice tasks that are active or on hold 
mixing of voice functionality and text functionality 

ability to switch off direct voice access to yourself giving you 
time to think or just have a break; voice calls can be diverted to 
voice mail for listening to them later 

Always -on 

Nowadays, most business internet connections are always-on rather than 
dial-up and this trend is likely to continue. Immediate Messaging 
(IM) is often present on the desktop in businesses all the time; with 
IM people can respond to incoming messages immediately. People spend 
ever more of each day at their personal computers. Many employees 
listen to music through headphones while they work. 



81 



WO 03/058518 



PCT/GB03/00031 



There are times when people do not wish to be interrupted by IM or by 
unplanned voice calls. In these times, IM can be switched to e-mail 
and voice calls to voice-mail . 

5 

Communication types 

Communications may use any type of media or any combination of 
multiple media. The term multi -media is used to cover all types of 
media such as but not limited to text, voice, video, image, animation 
10 and avatar. 



The following communication types are usually available in an Avatar 
User Interface System by way of example but an Avatar User Interface 
System is not limited to these communication types. Synchronous 
15 communication is when a communication is usually received in real-time 
and often responded to in real-time. Asynchronous communication is 
when a communication is usually received after a delay. 



Synchronous 

20 - avatar user interface session / avatar conference 
voice call / avatar call 
Immediate Messaging 
video 

Whispering 

25 

Asynchronous 

voice-mail / avatar voice -mail / video-mail 
e-mail 



30 An avatar call is when the avatar 5 of the user 17 appears in the 

Meeting Room Media window 50 whilst the synchronous voice call takes 
place. An Avatar voice-mail is when the avatar 5 of the user 17 
appears in the Meeting Room Media window 50 whilst the asynchronous 
avatar voice-mail is being played back. 
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Avatar User Interface 

User interfaces have evolved over the course of computer history. 
They have progressed from a bank of switches and lights through to 
5 sophisticated windows interfaces for information interaction and 
communication tasks. With the very recent advent of photo-realistic 
avatars and the maturing of voice processing technology, a new form of 
user interface is possible. 

10 It is a purpose of this avatar user interface system invention to 
disclose a new form of user interface for interacting with people, 
information, entertainment and avatar agents. 

Figure 42 is a representation of an example of the displayed avatar 
15 user interface 260 in this fifth embodiment. The new switchboard 
avatar user interface functionality 268 is added to the avatar session 
user interface 10 shown in Figure 12 such that both sets of functions 
are integrated and easily accessible through the same user interface 
hardware 3 . 

20 

The switchboard avatar user interface functionality includes: 

a buddy list 240 with data for each buddy such as name 244 and 
facial icon 243 

buddy list buttons such as add buddy 247, edit buddy 248 and delete 
25 buddy 249 

a switchboard 241 with numbered events 252 including live sessions 
with data for each session party such as name 244 and facial icon 
243, live conferences with conference name 253 and with data for 
each conference attendee such as name 244 and facial icon 243, 
30 streaming media channels 254 such as music, radio and TV and voice 

mails 255 

a status bar 250 with a message 251 and session control buttons 
such as start session to new party 242, end session 246 and whisper 
to a party on a session 245 

35 

Multiple session servers 

In a working day, millions of meetings and calls take place. It is 

technically challenging that all avatar user interface sessions 252 
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are served by a single session server 1. Additionally, it is likely 
that several companies will compete in the avatar user interface 
system marketplace, each company having one or more session servers 1. 

5 Figure 43 is a block diagram of a multi-session server system, it 
shows a personal computer 3 with an avatar user interface 260 
connected via a network 2 to two or more session servers 1. Different 
live sessions 252 may take place on different session servers 1. 
Protocol converters 301 are resident at different places on the 

10 system. 

Large economic benefits provide strong commercial forces for players 
in a market to agree standards . It is likely that a standard avatar 
interface protocol 300 between session servers 1 and avatar user 
15 interfaces 260 will be agreed for avatar user interface systems 261. 
Eventually this could form a global standard. 

It is a purpose of this avatar user interface system invention that an 
avatar user interface 260 on a personal computer 3 can simultaneously 
20 be connected to multiple sessions on a plurality of session servers 1 
using a standard avatar interface protocol 3 00. 

Multiple protocols 

If two or more standard avatar interface protocols 300 are used, 
25 protocol converters 301 can convert between the protocols in real 
time. The protocol converters 301 can be situated on the network 2 or 
within the session servers 1 or within the avatar user interface 260 
or any other suitable place. 

30 It is a further purpose of this invention that where two or more 
standard avatar interface protocols 3 00 are used, that protocol 
converters 301 can convert between the protocols in real time. 

SIXTH EMBODIMENT 

35 It is a purpose of this sixth embodiment that the invention is not 
limited to the displayed avatar user interface 260 running in a 
browser window 21, but that it can run as a stand alone application. 
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Figure 44 is a block diagram of the displayed avatar user interface 
260 on the display device 264 being driven by the avatar user 
interface software application 262 running stand-alone in memory 345 
on the personal computer 3 . 

SEVENTH EMBODIMENT 

It is a purpose of this seventh embodiment that the displayed avatar 
user interface 260 might be for a digital exhibition in which there 
are many virtual stands representing different organisations on which 
information about their products and services is accessible. 

Currently, there is no effective means of visiting an exhibition 
virtually. There is also no virtual means for simultaneously talking 
to a salesman, viewing his photo-realistic avatar and seeing 
information on the company's products and services. 

Figure 45 is a representation of an example of the displayed avatar 
user interface 260 containing the switchboard avatar user interface 
functionality 268, the avatar session user interface 10 and the 
exhibition user interface functionality 280. The exhibition user 
interface functionality 280 includes an exhibitor list 281 with 
different organisations 282 that can be selected. Pressing a browse 
button 2 83 enables the user 17 to enter a 3D meeting room media window 
50 of the selected organisation in which information media about the 
organisation's products and services is available for browsing by the 
user 17. Pressing a contact button 284 enables the user 17 to call a 
representative of that organisation into the organisation's 3D meeting 
room media window 50. The representative can be an actual person with 
his own avatar or an intelligent agent avatar. As on a physical 
exhibition stand, multiple users 17 can be present with one or more 
representatives of the organisation in the same exhibition 3D meeting 
room media window 50 . 

Whilst browsing in the 3D meeting room media window 50, a user 17 may: 
see objects representing products 286 

navigate by pressing on the object 286 or pressing buttons on the 
navigation bar 287 



85 



WO 03/058518 PCT/GB03/00031 
view it by pressing a button 288. The user's avatar can pick the 
product 286 up and turn it around if the product is of a suitable 
size. Alternatively, the product can be rotated by the user, 
buy it by pressing a button 285 
5 - be taken on a tour around the company' s products 286 by an 
intelligent agent avatar 5 if the user 17 presses the button 289. 

This embodiment is not limited to the functions disclosed here but 
covers any function from an actual exhibition that can be implemented 
10 virtually. 

There are many advantages of virtual avatar exhibitions as disclosed 
in this embodiment including: 

interactivity with the salesman using voice to communicate 
15 - ability to browse a 3D virtual stand without being approached by a 
salesman 

not having to spend time and incur cost travelling to actual 
physical exhibition locations 

not missing an exhibition due to a schedule clash 

20 

It is a purpose of this seventh embodiment to disclose a process 
wherein users communicate in virtual exhibition means comprising the 
following steps: 

a user navigates in a virtual exhibition stand of a company; 
25 - the user views and interacts with virtual objects representing 

products ; 

optionally the user communicates remotely with a real sales 
representative ; 

optionally the user communicates with an intelligent agent avatar; 
30 - optionally the user views presentations; 
- optionally the user buys the product. 

EIGHTH EMBODIMENT 

It is a purpose of this eighth embodiment that an avatar agent 5 is 
35 driven by an intelligent agent and not by the user 17. 



86 



WO 03/058518 PCT/GB03/00031 
Avatar agents 

Avatar agents are photo-realistic avatars driven by intelligent 
software agents rather than people. An avatar user interface system 
that will provide the benefits of avatar agents to people does not 
5 exist. 

Intelligent agent 

Figure 46 is a block diagram of an avatar agent hosting system and 
intelligent agent software in accordance with an eighth embodiment of 
the present invention. It shows intelligent agent software unit 320 
on an avatar agent hosting server (AAHS) 321 running with AAHS 
management software 322 stored in memory 348 driving an avatar agent 5 
in an avatar user interface window 260 on the display device 264 of a 
personal computer 3. The AAHS management software 322 manages one or 
more intelligent agent software unit 320 running concurrently on the 
AAHS 321. Alternatively, in a second client-server system 

architecture, the intelligent agent software unit 320 may be running 
in memory 344 on the avatar hosting server (AHS) 4. Alternatively, in 
a peer to peer system architecture, the intelligent agent software 
unit 320 may be running in memory 345 on a personal computer 3. 

The identity 275 of the avatar agent 5 is usually the same as for the 
intelligent agent software unit 320. The identities of the avatar 
agent 5 and the intelligent agent software unit 320 could also be 
25 different, in which case the avatar agent identity number would have 
to indicate on which avatar agent hosting service the avatar agent is 
resident . 

The intelligent agent software unit 320 can perform synchronously or 
30 asynchronously. It can communicate by outputting marked-up text 327 
or audio voice 185. It contains artificial intelligence software 323 
and a database of knowledge 324. It may also have access to further 
databases of knowledge 324 via the network 2. For voice communication 
it includes a speech recognition engine 182 and an agent text to 
35 speech engine 326. 

The intelligent agent software unit 320 can generate events 81 that 
are incorporated as mark-ups in the marked-up text 327 or output from 
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the agent text to speech system 326. The events 81 that go to the 
software director 80 cover such aspects as emotions and gestures. 



The actions 83 of an avatar agent 5 usually exhibit better anima- 
5 realism than the actions 83 of an avatar 5 driven from voice 185 
because the intelligent agent software unit 320 has more knowledge for 
generating events 81 than can be extracted from analysis of the live 
voice stream 185 of a user 17. 

10 The intelligent agent software unit 320 can represent itself visually 
with an avatar 5 that does not have the identity of a real person. 
Each avatar agent 5 is driven by one intelligent agent software unit 
320. The avatar agent 5 of the intelligent agent software unit 320 may 
be a parameter avatar 232, or it may be edited to look like a photo- 

15 realistic avatar 23 8 of a real person or it may be based on images 
taken of a real person with whom that person' s identity is not 
associated. 

The intelligent agent software unit 320 speaks through an agent text 
20 to speech engine 326 using impersonation parameters 325 that makes the 
voice 185 emit a characteristic profile. An example of a 

characteristic voice profile 184 is a middle-aged Scottish woman. The 
avatar agent 5 is impersonating a middle-aged Scottish woman. The 
impersonation parameters 325 are of two types: voice impersonation 
25 parameters 331 and action impersonation parameters 332. 

Agent impersonation of a person 

The agent avatar 5 can represent a real person 17 and use the photo- 
realistic avatar 238 of that real person 17. The impersonation 
30 parameters 325 can be the personalised voice profile of that 
particular person 17. In this way the avatar agent 5 can represent the 
real person 17 by looking like that person and sounding like that 
person whilst that real person is unavailable. 

35 Generating impersonation parameters 

Figure 47 is a block diagram of an apparatus for generating 

impersonation parameters. To obtain a high quality of recording, a 

person's 17 voice and movements may be recorded in a room 330 
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insulated from a noisy environmental location 273. Video 336 is 
recorded from at least one camera 29 and audio 18 5 is recorded using a 
microphone 12 of a person 17 reading known text 189 on a screen 264 of 
a personal computer 3. The impersonation parameter generation 
5 software 331 running in memory 345 on the personal computer 3 
processes the video 336 and audio 185 to generate a set of 
impersonation parameters 325. The impersonation parameters 325 are of 
two types: voice impersonation parameters 331 and action impersonation 
parameters 332. The voice impersonation parameters 331 are generated 
10 by processing the audio of the known text 189. The action 
impersonation parameters 332 are generated by processing the facial 
movements as the words in the known text 189 are spoken and as 
emotions are used. 



15 Using impersonation parameters 

The intelligent agent software unit 320 generates marked up text 327. 
The marked-up text 327 is processed by the agent text to speech engine 
326 using the voice impersonation parameters 331 of the person 17 to 
modify an existing speech database 328 by speech synthesis. The voice 
20 185 emitted by the agent text to speech engine 326 sounds like that of 
the person 17. The marked up events in the marked-up text 327 are 
modified by the action impersonation parameters 332 to produce gesture 
action events 81 that use characteristic gestures that the person 17 
normally uses when speaking. 

25 

The simplest application of avatar agent impersonation of a real 
person would be in a personalised answer-phone application. The 
intelligent agent software unit 320 may know who is calling and their 
access level to the real person's information. It may also know what 

30 activity the real person is currently involved in and when it is due 
to finish. It answers the avatar call with an appropriately 
personalised message. For example: "Hi John, I'm in an avatar session 
until 11.00, please leave an avatar-mail." The caller, John, will 
recognise the voice and see the avatar as if it were the real person 

35 he was calling. In applications requiring more intelligence from the 
intelligent agent software unit 32 0 such as a personal assistant 
application, more advanced bi-directional communications can take 
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place in which, for example, arrangements can be ^pencilled in' 
involving diaries. 



Avatar agents in this embodiment of the avatar user interface system 
invention will provide benefits to people in a wide range of 
applications, including as: 

call centre personnel: users can interact with an organisation via 
virtual agents instead of expensive call centre personnel; fields 
include: account payments, technical support, changing service 
levels 

sales representatives: users can discuss potential purchases with 
virtual agent sales representatives 

real estate: home purchasers can be shown round virtual 3D replicas 
of homes on the market by a virtual real estate agent 
entertainers: avatar agents will become performers in shows 
customised to the user's desires; what was a child's television 
programme becomes an interactive, personalised entertainment led by 
an avatar agent 

advisers: people can consult an avatar agent specialist for advice; 
fields include: independent financial advice, style of clothing, 
selection of make-up, dieting, fitness, sports, psychology, 
psychiatry , cooking 

newscasters: virtual newscasters will be able to read the news that 
you want when you want 

housekeepers: virtual agents will provide: management of the home 
network, automatic call out of home service personnel such as 
heating system technicians, automatic reordering of home 
consumables such as lavatory rolls, entry to trusted persons, 
interfacing with home residents via an avatar user interface system 
personal assistants at work: a virtual agent will become your 
personal assistant managing tasks including booking meetings, 
taking messages, making travel arrangements, carrying out research 
teachers: virtual tutors will time and pace e-learning courses to 
suit you and maximise your rate of skill acquisition; fields 
include education at all levels, music in which the avatar agent 
music teacher records and analyses a student's playing of an 
instrument, languages in which the avatar agent language teacher 
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can correct pronunciation and lead a discussion in the foreign 
language 

representatives of yourself: an avatar agent can represent you when 
you are off-line, participating in interactions with people and 
other avatar agents; looking like you and sounding like you 

Eventually, people will interact with avatar agents as they do with 
other people. It will be very difficult to distinguish between avatar 
agents and avatars driven by people. Where such communication is 
remote (not face to face) , it is likely that the same interface will 
be used for conversing with people as agents. 

It is a purpose of this eighth embodiment to disclose a process 
wherein generic action impersonation parameters are defined for a 
communication context comprising the following steps: 

recording a corpus of videos of the communication context; 
processing the corpus by a trained person along a timeline to 
produce an annotated timeline with actions of each communication 
context participant related to a number of parameters; 
analysing the annotated timeline by a trained person to produce a 
type definition of each action impersonation parameter and a set of 
rules that can be incorporated into a finite state machine for the 
communication context. 

It is a further purpose of this eighth embodiment to disclose a 
process wherein personal action impersonation parameters for a 
particular person are generated using an action impersonation 
generator/editor means involving manual input by a user comprising the 
following steps: 

in the first step, the user makes selections from a number of sets 
of generic action impersonation parameters at a high level; 
in the second step, the user edits the selections at a lower level; 
wherein the second step is optional and the user may or may not be the 
person for whom the personal action impersonation parameters are 
generated. 
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It is a further purpose of this eighth embodiment to disclose a 
process wherein personal action impersonation parameters for a 
particular person are generated automatically using an action 
impersonation generator/editor means comprising the following steps: 

in the first step, video recordings are made of the person 

carrying out a number of defined actions; 

in the second step, the action impersonation generator/editor 
automatically analyses the video recordings to generate a set of 
personal action impersonation parameters. 

It is a further purpose of this eighth embodiment to disclose a 
process wherein a software director uses voice impersonation 
parameters defined for an avatar to generate speech from text using 
text to speech engine means for an avatar such that the avatar speaks 
recognisably like the person it represents comprising the following 
steps : 

intelligent agent software unit means generates the text; 
text to speech engine means converts the text to speech ; 
the speech is played on the computing appliance. 

It is a further purpose of this eighth embodiment to disclose a 
process wherein voice impersonation parameters are defined for an 
avatar of a particular person comprising the following steps: 
recording the person speaking predefined text; 

processing the recording using impersonation parameter generation 
software ; 

the impersonation parameter generation software outputting the 

voice impersonation parameters for that person; 

storing the voice impersonation parameters in the avatar. 

It is a further purpose of this eighth embodiment to disclose a 
process wherein after a voice communication by a user, a speech 
recognition engine means processes the voice communication comprising 
the following steps: 

a user generates a voice communication by speaking; 

a speech recognition means processes the voice communication and 
outputs text; 
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the text is sent to any intelligent agent software units involved 
in the session. 



It is a further purpose of this eighth embodiment to disclose a 
5 process wherein a user speaks in a first language and an intelligent 
agent software unit operates in a second language such that text is 
translated by translation engine means comprising the following steps: 

a user generates a voice communication by speaking in a first 

language ; 

10 - a speech recognition means that operates in the first language 
processes the voice communication in the first language and outputs 
text in the first language; 

the text in the first language is translated by translation engine 
means into text in a second language ; 
15 - text in the second language is sent to any intelligent agent 
software units involved in the session capable of processing text 
in the second language. 

It is a further purpose of this eighth embodiment to disclose a 
20 process wherein a user understands a first language and an intelligent 
agent software unit operates in a second language such that text is 
translated by translation engine means comprising the following steps: 

an intelligent agent software units generates text in a first 

language; 

25 - the text in the first language is translated by translation engine 
means into text in a second language; 

text to speech engine means converts the text in the second 
language to speech in the second language; 

the speech in the second language is played to the user using 
30 loudspeaker means . 

NINTH EMBODIMENT 

It is a purpose of this ninth embodiment that the avatar user 
interface system 261 may be used for biometric security applications 
35 at locations such as airports or military installations. 
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There is an increasing need for security systems based on biometric 
identification at airports to combat terrorism and in many other 
security applications. Currently, a biometric security system based 
on photo-realistic avatars of people does not exist. 

5 

Biometric security 

Figure 48 is a block diagram of the avatar user interface system with 
extended security functionality in accordance with a ninth embodiment 
of the present invention. It shows a person 313 passing a security 

10 checkpoint 314. The person's identity 275 is contained on an identity 
source 310 such as a smart card carried by the person 313 or an 
implant in the person 313. The person's identity 275 is read from the 
identity source 310 by an identity source reader 311 attached to a 
personal computer 3. Identity processing software 312 in memory 345 

15 on the personal computer 3 calls up the avatar 5 corresponding to the 
identity 275 from the avatar hosting service 4 over the network 2. 
The avatar 5 corresponding to the identity 275 is displayed in the 
avatar user interface window 260 on the display device 264 of the 
personal computer 3. A security user 17 who is usually a security 

20 guard, can visually compare the person 313 to the avatar 5 
corresponding to the identity 275 presented by the person 313. If the 
person 313 and the avatar 5 are not similar then the security user 17 
can stop the person 313 for questioning. To enhance security, the 
network 2 and avatar hosting service 4 may be private to the 

25 organisation conducting the security check. 

In a more automated method of conducting security checks, a camera 29 
attached to the personal computer 3 takes images 19 of the person 313. 
Image processing and comparison software 315 in memory 345 on the 

30 personal computer 3 can automatically compare the images 19 to the 
avatar 5 corresponding to the identity 275 presented by the person 
313. If the image processing and comparison software 315 finds a 
significant discrepancy between the images 19 and the avatar 5 
corresponding to the identity 275 presented by the person 313, then 

35 the security user 17 is alerted. Image processing and comparison 
software 315 are well known to those skilled in the art; however, 
increasing the accuracy of such software is still a research area. 
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Where a discrepancy is found, the images 19 may be sent to remote 
image processing and comparison software 315 in memory 344 on the 
avatar hosting server 4 which will compare the images 19 with a 
database 318 of images of known people because they are a security 
5 risk or employees or for any other reason. If the remote image 
processing and comparison software 315 on the avatar hosting server 4 
makes one or more possible matches then these possible matches are 
communicated to the security user 17 via the avatar user interface 
window 260 on the display device 264. The remote image processing and 
10 comparison software 315 and database 318 are not limited to being 
resident on the avatar hosting server 4 but may be resident on any 
server accessible via at least the network 2 . 

An intelligent agent software unit 320 resident in memory 345 on the 
personal computer 3 represented by an avatar 5 in the avatar user 
interface window 260 on the display device 264 communicates with the 
security user 17. The intelligent agent software unit 320 may 
generate communications to the security user 17 relating to the 
advisable actions to be taken depending on the results of any 
comparisons made by the image processing and comparison software 315. 
The intelligent agent software unit 320 may respond to communications 
from the security user 17 such as requests for further comparisons. 

Certainty in verifying identity can be increased by combining the 
results of two or more biometric devices. In addition to the camera 
29 comparing to the avatar 5, a biometric device 316 connected to the 
personal computer 3 could measure another part of the person and 
compare it to reference biometric data 317 linked with the avatar 5 
via the identity 275. Typical biometric devices include fingerprint 
scanning, iris scanning, hand scanning, face recognition and voice 
pattern recognition. 

It is a purpose of this ninth embodiment to disclose a security 
process comprising the following steps: 
35 - a person providing an identity source that is read by an identity 
source reader; 

retrieving the avatar whose identity matches the identity in the 
identity source; 
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displaying the avatar; 

a security user visually comparing the avatar with the person. 



It is a further purpose of this ninth embodiment to disclose a largely 
5 automated security process comprising the following steps: 

the person providing an identity source that is read by an identity 
source reader; 

retrieving the avatar whose identity matches the identity in the 
identity source; 
10 - extracting the avatar biometric data from the avatar; 

a biometric device scanning part of the person to provide scanned 
biometric data; 

comparing the scanned biometric data with the avatar biometric 
data; 

15 - if the scanned biometric data does not match the avatar biometric 
data then alerting the security user; 
displaying the avatar to the alerted security user; 

the alerted security user visually comparing the avatar with the 
person. 

20 

This embodiment is not limited to the disclosure provided. For 
example, the intelligent agent software unit 320 might be resident on 
an avatar agent hosting server (AAHS) on the network 2 instead of on 
the personal computer 3. Two or more security checkpoints 314 at one 
25 location may be connected to a single personal computer 3 . In a large 
establishment, multiple security checkpoints at multiple locations may 
be wired to multiple personal computers 3 in one or more security 
rooms monitored by multiple security users 17. 

30 This embodiment of the avatar user interface system invention has 
significant utility. It can support a security guard in making a 
quick visual verification that the person showing an identity is 
actually the person to whom the identity belongs. In a more automated 
form, it can alert a security guard when a discrepancy between the 

35 person going through a security checkpoint and his avatar is detected. 
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TENTH EMBODIMENT 

It is a purpose of this tenth embodiment that the avatar user 
interface system 261 is used for interactive computer games. 

5 On-line interactive computer games do not exist where the user is 
represented by a photo-realistic avatar 238 of himself. On-line 
interactive computer games do not exist where avatars 5 can exhibit 
lip synchronised animation in real-time to voice transmitted over a 
network 2 . 

10 

Games 

Figure 4 9 is a block diagram of an avatar user interface system 261 
for interactive computer gaming in accordance with a tenth embodiment 
of the present invention. Users 17 interact with their personal 

15 computers 3 . A session server 1 handles the voice mixing between 
users 17 . An avatar hosting server 4 hosts the avatars 5 of the users 
17 which are sent to the personal computers 3 . A Game Hosting Server 
370 hosts the game software 371, the state 372 of the game and billing 
software 237 in memory 374. A network 2 connects the servers and 

20 personal computers. If the game involves avatar agents, one or more 
avatar agent hosting servers 321 may serve the avatar agents 5. 

Special game interface equipment 373 may be attached to the personal 
computer 3 containing sensors to detect the movements of the user and 
25 feedback devices to stimulate the user's senses. 

The computer games industry is clearly structured with a number of 
game genres such as roll playing games (RPG) , sports including 
football and wrestling, car racing, God simulations, strategy games, 
30 board games and first-person fighting. 

Some of these genres have found a place in on-line gaming in which 
users play the game with each other over a network. The game is 
hosted on a game server that is also on the network. 

35 

It is a purpose of this avatar user interface system invention that a 
new genre of communicative avatar on-line game may be built using the 
avatar user interface system 261 that was not possible before. Users 
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17 see each other in the virtual environment of the game software 371 
as photo-realistic avatars 5. During the game, users 17 can 
communicate with each other by voice as if they were in the same room. 

5 In some game designs, the user 17 may navigate his avatar 5 through 
the virtual environment of the game using normal personal computer 
input devices such as mouse and keyboard. Examples of this are 
environments where users navigate towards other people's avatars to 
meet them. 

10 

In other game designs, the actions of the user's avatar are generated 
by the software director 80 in reaction to events. As an example, in 
a game of chess, the user 17 may move one of his chess pieces from 
one square to another and the software director 80 will show the 
15 user's avatar 5 picking up a piece and moving it from one square to 
another . 

In an on-line game 371 with a shared virtual environment, the state 
372 of the game must be maintained on the game hosting server 370. In 
this way, the shared virtual environment of the game 371 is the same 
for all users 17 at all times because there is only one state. The 
only time that there are differences is if there are delays or lags on 
the network 2. However, state discrepancies caused by the software 
director 80 playing actions in anticipation of what will happen in the 
game 371 whilst waiting for the new state 372 to be synchronised over 
the network 2 delays can be quickly corrected by the software director 
80 when the new state 372 arrives at the personal computer 3. 

It is readily apparent that there are advantages of this embodiment 
30 when during an online interactive computer game either photo-realistic 
avatars 238 can be viewed or avatars 5 can be lip synchronised or 
both. The advantages include combined audio and visual recognition 
and suspension of disbelief such that the user finds the game 
compelling and the sessions are longer. 

35 

It is a purpose of this tenth embodiment to disclose a process wherein 
users communicate in an interactive game hosted on a game hosting 
server comprising the following steps: 
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a first user interacts with the game, navigates around the 3D game 
scene and views the avatar of a second user; 

the second user interacts with the game, navigates around the 3D 
game scene and views the avatar of the first user; 
5 - the first user communicates by speaking; 

the second user hears the first user and views the avatar of the 
first user in lip synchronisation with the first user's speech; 
the second user communicates by speaking; 

the first user hears the second user and views the avatar of the 
10 second user in lip synchronisation with the second user's speech. 

ELEVENTH EMBODIMENT 

It is a purpose of this eleventh embodiment that the avatar user 
interface system 261 is used in immersive virtual reality (VR) 
15 environments . 

There is a variety of immersive VR systems. These include but are not 
limited to VR headsets and caves. A person can wear a VR headset in 
which his view of the physical environmental location 273 is replaced 
20 by viewing a display apparatus on which the virtual environment is 
displayed. A person can enter into a cave, that can be generally 
defined as a partially or fully enclosed physical space in which the 
display area is large. The person sees a virtual environment that has 
been projected onto the walls, floor and ceiling of the room. 

25 

An immersive VR system does not exist in which people are represented 
by photo-realistic avatars of themselves. Nor does an immersive VR 
system exist where people's movements can be motion tracked and used 
to drive photo-realistic avatars of themselves in other locations. 

30 

Immersive VR 

Figure 50 is a schematic of an avatar user interface system 261 for a 
six-sided cave 350 in accordance with an eighth embodiment of the 
present invention. The six faces of the cave 350 are illuminated by 
35 six back projectors 352. The six back projectors 352 are connected by 
one or more cables 354 to a computer 355. The computer 355 contains 
an avatar player engine 210 and a cave display system 357 in memory 
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345 and is connected to a network 2. The avatar player engine 210 
generates a 3D virtual environment 356 containing avatars 5 that 
usually changes over time. The avatar player engine 210 transmits the 
3D virtual environment 356 to the cave display system 357. The cave 
5 display system 357 generates the digital projector images 353 and 
transmits them to the back projectors 352. The six faces of the cave 
350 are fabricated from a material that permits back projection such 
that the six projected images 3 53 are visible to the user 17 from 
inside the cave 350. Each projected image 353 is a sequential stereo 

10 pair from which a 3D effect can be experienced. A user 17 wearing 
shutter glasses 351 is inside the cave 350. The shutter glasses 351 
combine the stereo pair image 3 53 displayed onto each wall to form a 
3D virtual environment 356. The 3D virtual environment 356 appears to 
stretch from right next to the user 17 to many hundreds of metres 

15 away. The experience is vivid and a strong sense of presence in the 
virtual world is experienced by the user 17. The user 17 sees a 3D 
virtual environment 356 with an avatar 5. The user 17 can see an 
avatar 5 in 3D when the user 17 is facing in the direction of the 
avatar 5. If the avatar 5 is central in the cave 350, the user 17 can 

20 walk through or around the avatar 5, turn and see the avatar 5 from a 
different viewpoint. The avatar 5 can move in the virtual environment 
356 relative to the user 17. 

It is often possible for several people to be in a cave 
25 simultaneously. In a cave 350, the user 17 can see parts of himself 
17 such as his legs 358 and arms 359, the people with him in the cave, 
the virtual environment 356 and the avatars 5. 

This eleventh embodiment is not limited to a cave with six sides, a 
30 physical space with a display of one or more sides can be used. 
Conventional displays such as a monitor or plasma screen can be used. 
Shutter glasses are one method of converting images into a 3D 
environment, but this invention can incorporate a wide range of 3D 
display technologies . 

35 

Networked VR 

One or more users 17 in a cave 350 at a first location can be 
connected via a network 2 to one or more other users 17 in another 
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cave 350 at a second location such that all the users 17 appear as 
avatars 5 immersed in the same 3D virtual environment 3 56. It is 
advantageous for users at one cave location to see the movements of 
the users at the other cave location. For this, a motion capture 
5 system is required at each location. 

Figure 51 is a schematic of an avatar user interface system 261 for 
two caves 350 connected by a network 2. A motion capture system 368 
is integrated into the cave 350 at location 1. The motion capture 

10 system 368 comprises four cameras 362 viewing the internal area of the 
cave 350 connected by a cable network 365 to a computer 363 running 
motion capture software 364 in memory 345. The user 17 wears a suit 
360 to which infra-red emitters 361 are attached. As the user 17 
moves around in the cave 350, the motion capture software 364 on the 

15 computer 363 calculates the motion 369 of the user 17. The motion 369 
is sent to the cave 350 at location 2 and the motion 369 is played on 
the photo-realistic avatar 5 of the user 17. 

There are many types of motion capture system and this invention is 
20 not limited to the type disclosed. For example, the motion capture 
system might be passive and not require the user to wear a seat with 
active emitters. 

As an alternative to a cave, a user 367 may wear a VR headset 366 
25 whilst moving inside the motion capture system 368. 

As an alternative to cabling 354 and or 365, wireless networks could 
be used. 

30 Some of the avatars 5 might be avatar agents 5 driven by intelligent 
agent software unit 320 rather than users 17. In this way, agents and 
users can mingle and interact in a 3D virtual environment 356 without 
it being immediately obvious which avatar 5 is driven by an agent or a 
user. 

35 

This invention is not limited to participants in just two locations 
being immersed in the same 3D virtual environment 356. Three or more 
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locations may be used. At each location, there could be a cave or the 
user could use a VR headset . 



At some locations there may not be motion capture so that the user 
5 could not be seen to move around. Where the user is viewing only he 
could either be invisible to the other users or visualised in a fixed 
position either standing or sitting. In this instance, the user could 
use an avatar user interface as disclosed in the First Embodiment. 

10 During networked immersion, each user 17 can wear a headset 11 for 
audio communication with the other participants. 

There are significant advantages of this embodiment of the avatar user 
interface system. Fundamentally, this embodiment discloses means by 
15 which the most realistic immersive VR experience can be achieved and 
will thereby achieve a high sense of presence in the session. 
Applications of this eleventh embodiment, cover entertainment, 
communication and collaborative work; but this embodiment is not 
limited to these applications. 

20 

It is a purpose of this eleventh embodiment to disclose a process 
wherein users are present in Cave means with motion capture systems 
means comprising the following steps: 

the motion capture system means records movements of a first user 
25 in a first Cave means; 

the recorded movements are sent with acceptable lag from the first 

Cave means to a second Cave means; 

an avatar of the first user is displayed in the second Cave means 
such that the movements of the avatar duplicate the movements of 
30 the user in space; 

a second user wearing shutter glasses or similar immersive 3D 
viewing means in the second Cave means views the movements of the 
avatar of the first user as if the first user were physically in 
the second Cave with the second user. 

35 

TWELFTH EMBODIMENT 

It is a purpose of this twelfth embodiment that the avatar user 

interface system 261 may be connected to exercise equipment. 
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Health and training 

There is an increasing consumer demand for healthier lifestyles. 
People buy exercise equipment such as rowing and running machines for 
5 use at home. However, many people lack the motivation to use them on 
their own. 

Figure 52 is a schematic of an avatar user interface system 261 
comprising two exercise stations 414 connected together by a network 
10 2. An exercise station 414 comprises a piece of exercise equipment 

410 used by a user 17 wearing a pulse rate gauge 415 with a processor 

411 connected by a cable 413 to a personal computer 3 running exercise 
equipment interface software 412 and avatar user interface software 
262 in memory 345 with a display device 264 showing an avatar user 

15 interface 260 viewed by the user 17. 

Many items of exercise equipment come with a built-in processor and a 
connection to a personal computer such that the personal computer can 
monitor and or control the exercise equipment. Examples of parameters 
20 monitored from the exercise equipment might be speed, strength 
setting, energy dissipation rate, user pulse rate and cumulative 
energy dissipated. An example of a parameter that might be controlled 
is the strength setting of the exercise equipment. 

25 The two users can share their exercise as a social experience. A 
first user 17 can see the avatar 5 of the second user 17 in his avatar 
user interface 260 in a scene showing the avatar 5 of the second user 
using a virtual exercise equipment 410. As the exercise equipment 410 
of the second user 17 is moved by that user, the exercise equipment 

30 interface software 412 monitors the movements of the exercise 
equipment 410 and sends them over the network 2 to the avatar user 
interface software 262 on the personal computer of the first user 17. 
In this way, the first user sees the avatar 5 of the second user 
moving on the virtual exercise equipment in the avatar user interface 

35 260 in substantially real time compared to the actual movements of the 
second user. If the second user stops using the exercise equipment, 
then the first user will see almost immediately that the avatar of the 
second user has stopped using the exercise equipment. During the 
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avatar user interface system interaction, the two users may talk to 
each other using the headsets 11. 

The scope of this twelfth embodiment is not limited to the disclosure 
5 above. As will be understood by persons skilled in the art, the 
connection between the exercise machine 410 and the personal computer 
could be wireless rather than a cable 413. The display device 264 and 
the personal computer 3 might be built into the exercise machine 410 
which connects to the network 2 . The headset 11 may be connected to 
10 the personal computer 3 by wireless rather than a cable. Loudspeakers 
may be used instead of headphones. Other biometric devices may be 
worn by the user 17 in addition to the pulse rate monitor. The 
exercise equipment interface software 412 may correlate performance of 
each user over a number of sessions and generate statistical data to 
15 track increases in fitness. The wearing of a pulse rate gauge 415 is 
optional . 

This twelfth embodiment is not limited to two users but three or more 
users may be connected simultaneously. One user 17 may be a personal 
trainer for another user 17 and use the avatar user interface system 
to both monitor and encourage the first user. A personal trainer 
could train several users simultaneously. Users may compete against 
each other on certain parameters such as speed, strength and 
endurance. International virtual competitions may be held with their 
appearance in the avatar user interface system being similar to that 
of a televised sports event. A user 17 may be a medical doctor who 
can monitor remotely the health of a user 17 who is a patient. An 
avatar intelligent agent software unit 320 may take the role of a user 
or personal trainer or doctor or any other professional such as a 
sports therapist . 

Furthermore, this twelfth embodiment may be combined with features 
from the fourth embodiment such that two or more people at one 
location can exercise together whilst being in contact with one or 
35 more other people at one or more other locations. 

This twelfth embodiment of the avatar user interface system invention 
enables a person who is in one location to carry out a physical 
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activity whilst in virtual contact with one or more people in other 
locations. Advantages of this embodiment include: time and cost saved 
travelling by each user to an agreed location where they can exercise 
together, increased motivation by exercising whilst in virtual contact 
5 and not needing to dress up to be seen in public. 

It is a purpose of this twelfth embodiment to disclose a process 
wherein users communicate whilst exercising on exercise station means 
comprising the following steps: 
10 - a first user using a first exercise station means; 

- a second user using a second exercise station means; 

the first user viewing the avatar of the second user using a 

virtual exercise station; 

the second user viewing the avatar of the first user using a 
15 virtual exercise station; 

the first and second users communicating by voice; 

optionally the first and second users viewing performance data 
generated by the first and second exercise station means; 
optionally any user being able to see if the other user has stopped 
20 exercising. 

THIRTEENTH EMBODIMENT 

It is a purpose of this thirteenth embodiment that the avatar user 
interface system 261 may be used for practicing and planning. 

25 

Currently, no avatar user interface system means exist for a person to 
practice or plan something virtually, either with other people or with 
avatar agents or both. 

30 Practicing might cover exercises for learning a new skill, preparing 
for delivery of an event or planning an event. Examples of 
applications that require practicing include: language learning, 
learning touch typing, delivering a presentation, public speaking, 
playing music, rehearsing a play, overcoming a fear such as that of 

35 public speaking by practicing in a virtual environment, planning the 
choreography of a ballet and planning the direction of an event. 
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Practicing using the avatar user interface system 261 involves the 
person practicing generating output into the avatar user interface 
system 261 by means of voice, camera, keyboard, mouse or other 
specialised peripheral. This input may be fed to another person or an 
5 agent where it is processed and feedback is given to the person 
practicing. Feedback may be verbal or visual. Emote keys may be used 
by the person feeding back such that a person's avatar can visually 
show pleasure, displeasure, comprehension, confusion and other 
emotions . 

10 

In the case of planning, a person planning will create a plan. This 
can be done collaboratively with others in synchronous or asynchronous 
ways. Synchronous planning will involve real-time interactions 
between users . Asynchronous planning might involve one person 
15 creating a plan such as a choreography for a ballet and others feeding 
back at a later time. In this case, a set of tools and props will 
usually be required for the application being planned. 



FOURTEENTH EMBODIMENT 

20 It is a purpose of this fourteenth embodiment that the avatar user 
interface system 261 has an Avatar Virtual Environment (AVE) as the 
background to the display device 264 and that the desktop 423 is 
present and usable on a virtual computing appliance 421 within the 
AVE. 

25 

Figure 53 is a schematic of the display 264 of an avatar user 
interface system 261 with an avatar virtual environment (AVE) 420 as 
the background in accordance with this fourteenth embodiment. A 
virtual computing appliance 421 with a virtual computing appliance 
30 display 422 is present in the AVE 420. The desktop 423 of the PC 3 is 
shown on the virtual computing appliance display 422. The virtual 
computing appliance 421 is not always visible in the AVE 420 because 
visibility depends on whether it falls within the field of view of the 
virtual camera being used at that instant. 

35 

Avatar Virtual Environment (AVE) 

Existing PC operating system user interfaces 20 are largely based upon 
the Windows concept such as the Microsoft Windows XP operating system. 
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This user interface concept is referred to as the windows user 
interface. The windows user interface usually occupies the whole of 
the display area of the display device 264. The windows user 
interface usually consists of a desktop 423 background covering the 
5 whole display area and may have one or more windows open on top of the 
desktop 423. Any one window may be opened fully to cover the whole 
desktop 423. 

This avatar user interface system invention includes one or more 
windows containing an Avatar Virtual Environment (AVE) 420. An AVE is 
a photo-realistic virtual environment with photo-realistic avatars in 
it . The avatar user interface system of the First Embodiment uses 
avatar conference windows 23, 24 and 25, which are AVE windows, open 
in the context of the windows user interface. Controls such as 
control buttons 27 are situated outside the avatar conference window. 

Frames/ presence and cognitive jolt 

A person often has several frames of interaction around him. These 
might include other people in the room, the computer' s desktop 
display, various active applications on the display, music and a 
telephone. At any one time, the person's bandwidth of consciousness 
is spread between one or more frames. Although the term presence does 
not have a generally accepted definition, applications with high 
presence are applications in which the person tends to be very 
immersed such that awareness of other frames is only peripheral . 

Normally, there is a strong cognitive jolt in switching between 
frames. Someone may call you by name when you are playing a computer 
game and it may break your concentration with a jolt. The design of 
30 the windows user interface minimises cognitive jolt when moving 
between application windows. Currently, an AVE 420 operating as a 
window on a desktop 423 tends to result in a low feeling of presence 
because there is no meaningful metaphor between the AVE window and the 
neighbouring applications. There is also a high cognitive jolt when 
35 transferring between the AVE window and a neighbouring non-AVE desktop 
window. 

107 



WO 03/058518 PCT/GB03/00031 

By placing the desktop 423 on a virtual computing appliance display 
422 within the AVE 420, there is a continuous metaphor and lower 
cognitive jolt as the user 17 transfers between the AVE frame and a 
desktop window frame. 

5 

By inputting to the PC 3 with a user input device such as a keyboard 
14 or a mouse 15, the user 17 can move the virtual camera 71 such that 
the desktop 423 on the virtual computing appliance 421 is larger or 
smaller in the display device 264. The user 17 may also operate the 
10 desktop 423 on the virtual computing appliance 421 using input devices 
such as a keyboard 14 or a mouse 15. 

This fourteenth embodiment of the avatar user interface system 
invention enables a person to shift between frames with low cognitive 
15 jolt. Advantages of this embodiment include: improved communication, 
better task efficiency, a more suitable interface for multi-tasking 
between verbal tasks and information tasks and higher usability. 

Avatar agent sharing virtual computing appliance in AVE 

It is a further purpose of this fourteenth embodiment that an avatar 
agent and a user may communicate in an AVE with a virtual computing 
appliance in it; the virtual computing appliance may be used by the 
avatar agent to communicate information to the user and by the user to 
communicate information to the avatar agent. 

By way of disclosure of this fourteenth embodiment, a sample script is 
provided that might have been enacted between an avatar agent called 
Johan and a user using an AVE with a virtual laptop in it. The domain 
is the avatar agent giving professional advice to the user on risk 
management . 

* Johan is the Advisor. He is a slightly old-fashioned *Mad 
Professor' character, dressed in an old-style suit and bow 
tie. His half -glasses are at the end of his nose. He is 
35 seen seated at a desktop with a laptop facing towards the 

SME user. Behind him, through the glass panels of the 
meeting room, the user sees a huge, high-tech data centre 
which gives the impression of vast knowledge. Johan speaks 
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in a way that makes a bit of a caricature of himself. But 
he comes over as someone you would like and trust. In the 
Advice domain you can only talk to him; the mouse and 
keyboard are not used' 

The opening shot sets the scene: camera at first person 
point of view of SME user; Johan' s head, upper body visible 
plus data centre in background; virtual laptop partially 
visible on table orientated partly towards SME user 

Johan "'Hello, My name is Johan and I'm a most expert Risk 

Advisor. I'm a bit hard of hearing, it takes a while for 
me to work out what you say and I get easily confused. So 
please just answer my questions precisely and we'll be 
fine. If at any time I misunderstand you, just interrupt 
and say x No, that's wrong'. Now, I'm a busy person so 
let's get on with it. First of all, which of these 
industries is your company working in?" 

Camera pans and zooms to virtual laptop screen on which you 
see a list of industries. 

Some seconds pass without the SME replying. Shot to Johan. 

Johan "Come on, surely you know what industry your company is in. 

Just read out the closest industry." 

SME "Woodworking" [Johan looks up in the air and thinks about 

this for a few seconds; this metaphor gives the intelligent 
agent time to plan a response] 

Johan "I suppose you are consulting me because your workshop 

caught fire recently." [Johan laughs] 

"Sorry, bad joke." [Slight pause, Johan leans forward] 
"Right, I've looked at all the claims that we have had in 
the woodworking industry and these appear to be the risks 
in the Woodworking industry." [Shot to virtual laptop, 
showing list of risks] 



109 



WO 03/058518 PCT/GB03/00031 

"Please read each one. If it does not apply to you please 
say 'Not risk 7 or whatever the risk number is' . OK off 
you go and say finished when you are done" 



SME 



[Several second pause] 
"Not Risk 9" 



10 



Johan [Shot to Johan] 

"No IT risk - you mean you don't have many PCs . What's the 
next risk you don' t have?" 



SME 



[Shot to laptop, short pause] 
" Finished" 



15 Johan 



20 



25 



[Shot to Johan] 

"Great. This is your company's risk profile." 
[Shot to laptop , risk severity/probability graph appears. 
Area under red line flashes] 

"The risks under the red line are so small you don't want 
to worry about them. You run a successful business, you 
can probably stand small losses like those" 
[Area above red line flashes] 

"But you want to worry about those risks towards the top 
right" 

[Worst risk flashes] 

"Especially that one! Industrial accident risk. Could be 
nasty. Claims of more than a million Euro are not 
uncommon . " 



30 This embodiment is not limited to a single avatar agent, there may be 
a plurality of avatar agents interacting with the user. The virtual 
laptop is one example of a virtual computing appliance and other 
virtual computing appliance might be used in its stead such as virtual 
plasma screens. The user 17 may use input means to the AVE other than 

35 voice. Such means might include a keyboard 14 or mouse 15 which when 
used to create input, the input would appear directly on the virtual 
computing appliance display 422 visible on the display device 264. 
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FIFTEENTH EMBODIMENT 

It is a purpose of this fifteenth embodiment that the avatar user 
interface system 261 comprises motion capture means and software 
director means to improve the sense of co-presence during a 
5 communication session. 

Motion- tracking terminal 

Figure 54 is a schematic of a motion- tracking terminal 265 of an 
avatar user interface system 261 including motion- tracking cameras 29 
for a communication session. Three users 17 sit on chairs 174 around 
a table 172. At the end of the table 172 is a display device 264 with 
an AVE 420 displayed. The AVE 420 is displayed in such a way that the 
virtual table 51 in the AVE 420 appears to be a continuation of the 
physical table 172. Behind the virtual table 51 sit avatars 5 
representing users 17 at other environmental locations 273 . The AVE 
background behind the avatars 5 includes a virtual meeting room with 
windows 60, door 58 , walls 55 and ceiling 56. Each user wears a 
microphone 12 . There are loudspeakers 173 for outputting the voices 
of the participants that are not at that location. As disclosed in 
the Fourth Embodiment, sound is mixed. 

Motion- tracking 

The benefit of tracked avatar animation is to provide additional 
visual cues for facial expressions, head movements and hand gestures 
25 which contribute to natural face -to- face communication. Cameras 29 
around the display device 264 capture the movements of participants. 
Images from the cameras 29 are processed in real-time to track facial 
animation, eye gaze, upper body movement and gestures. In a second 
step, the 2D tracked movements are mapped onto the 3D virtual 
30 environment and avatars of each person. In a third step, 

parameterised animation is generated. Video-based motion capture is 
used for non-invasive capture of face and body movements using a small 
number of cameras 29 surrounding the display screen 264. This motion 
capture augments the body and face movements of a participant's avatar 
35 animation where no motion capture input is available. A key 
innovation is the mapping of the captured movement to parameterised 
avatar motion models based on real movement to achieve realistic 
avatar animation that is robust to errors in the visual tracking. 

Ill 
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Parameters control motion characteristics such as movement speed and 
size. The emotional content of the original movement is conveyed 
whilst avoiding artefacts due to errors in tracking. Adaptive 
background subtraction is used to separate foreground objects (people) 
5 from the background scene and avoid the requirement for highly 
structured backgrounds (blue-screen) or constant scene illumination. 

Eye-Gaze Direction 

Eye-contact is an essential visual cue in face-to-face communication. 
To establish eye-contact between a virtual avatar and real 
participant, eye gaze direction of all participants is reconstructed. 
In a virtual meeting it is critical to establish which participant 
each person is looking at in near real-time. To achieve this, key 
facial features are tracked for each participant using a statistical 
template of facial appearance for each individual based on their 
avatar model. This is used to robustly identify the location of the 
eyes at each time instant. The use of a model-based vision approach 
allows the three dimensional location of these facial features to be 
reconstructed. A dynamic eye -template which models the appearance of 
the eye with changes in viewing direction according to the iris 
location is then used to reconstruct the approximate viewing direction 
of the subject. Estimated gaze direction is used to identify if a 
participant is looking at the facial region of another real 
participant or avatar. Eye-contact is then established with the 
corresponding avatar. Avatar gaze-direction is animated to ensure 
correct eye-contact together with smooth transition of eye-contact 
between participants and with the background scene (ie the participant 
is not paying attention or looking at other documents) . 

30 Gesture Reconstruction 

Established motion capture algorithms are used to reconstruct a 
subject's hand and head movement from the video streams. This approach 
utilises a real-time inverse kinematics engine to recover the 
approximate movement as estimates of joint angles. The reconstructed 
35 movement is mapped directly to the animated avatar using a dynamic 
filter to constrain the movement, impose joint angle limits and 
provide smooth animation. To achieve greater anima- real ism, techniques 
for mapping the captured noisy movement into parameterised gestures 
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are used. A database of parameterised realistic gestures is 
established using conventional marker-based motion capture to 
construct models of common gestures and explicitly parameterise the 
intra-gesture variation. Statistical models based on learning from 
5 visual data identify the gesture class and map the gesture to the 
appropriate set of parameters. This model-based approach to gesture 
animation enables smooth and realistic gesture animation from noisy 
input data. 

Facial Expression Recognition 

A key visual cue in face- to- face visual communication is the secondary 
facial expression in conjunction with speech. A model -based 
methodology is adopted based on a highly sophisticated facial 
animation model. The facial animation model encodes parametric models 
of facial expression that express both the extent of movement and the 
temporal duration of the movement. Video analysis of facial expression 
using particle filters identifies key facial features corresponding to 
different facial expression. Statistical models of facial expression 
are learnt from labelled video sequence of multiple individuals. The 
learnt statistical models are used to identify the class of facial 
expression or combination of expressions. Finally detailed analysis of 
facial features is applied to identify the spatial and temporal 
parameters for a particular expression. The captured facial expression 
parameters are then used to augment the avatar facial movement 
synchronised with speech. 

Multiple users 

Although three users 17 are shown at the motion- tracking terminal 265 
in Figure 54, this invention is operable for one or more users at each 
30 motion- tracking terminal 265. Depending on the detail design of a 
motion- tracking terminal 265, there are limits to the number of users 
17 that can be motion tracked. One limitation comes when there are so 
many users close together that the motion tracking system cannot 
resolve which movements belong to which person. A second limitation 
35 is that of the computing power of the motion tracking system to follow 
a maximum number of users 17 simultaneously. A third limitation is 
from the number of chairs that can be fitted around the table 172 . 
For large sessions, this motion- tracking terminal permits two or more 
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rows of chairs and for people to stand behind those sitting in the 
chairs- However, in this case most participants will not be motion 
tracked. 

5 Setup 

The input of users to the meeting consists of speech and motion. It 
is important that the captured speech and motion are attributed to the 
correct avatar on other user devices. 

Speech is identified automatically by means of linking the identity of 
each microphone 12 with the avatar number 8 of the user 17. A person 
working in an organisation could have his identity card and his 
wireless microphone linked together. His microphone could be used for 
all voice input applications in the organisation such as fixed 
telephone, mobile telephone, paging, PC interaction and avatar user 
interface sessions. The organisation's database would link the 

person's identity, the microphone identity and the person's avatar 
number 8. This would be made available to the radio transceiver 170. 

There are several ways of identifying the motion of a tracked person 
with the person's identity ie locating the person in the room. A low- 
technology way is a manual process using a seating plan. Chairs 174 
are always in known positions and numbered: Chair 1, Chair 2 etc. In 
a manual setup process at the start of the session, a user 17 at each 
location identifies the avatar number of the person in each chair by 
means of direct input into the avatar user interface system 261, 
normally using keyboard 14 or mouse 15. This manual setup process 
works but relies on fixed chair positions. 

30 A more flexible manual process is the interactive identification of 
each user 17. The motion- tracking terminal 265 knows who is present 
but not where they are located. In a simple procedure at the start of 
the session, the software director 80 asks each user in turn to wave 
both arms until the motion tracking system has located him. This 

35 enables people to move chairs around to suit the number of people 
present. One drawback of this method is that if people move around, 
the system might lose them eg if they leave the room to get something 
and then return. A drawback with manual processes is that 
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identification can take some time if there are a lot of people present 
and this wasted time costs money. 



Ideally, the system should be automatic. There are several methods 
5 of achieving this. A first method is that wireless microphones are 
automatically tracked by triangulat ion of the signal between two or 
more receivers to estimate the location of the person in the room. 
These estimated locations are automatically mapped onto the motion- 
tracking system output to identify each moving person automatically. 

10 

In a second method, each microphone on the system has a visible, 
signal emitting light that is tracked by the cameras. The code of the 
signal emitting light is unique and associated with the identity of 
the person. The cameras map the light to the movement of the person 
15 to automatically identify the person. The advantage of tracking the 
microphones is that the microphone will always be within a small 
distance from the head/ neck of a person. 

None of these manual and automatic methods are perfect. Each has its 
20 own advantages and disadvantages. It is a purpose of this invention 
that a setup means be provided, either manual or automatic, for 
locating the position of each participant in the room. Automatic 
methods are better than manual methods, since they are more robust to 
movement during the session. 

25 

Terminal sizes 

The motion tracking terminal 265 might be designed as a range of 
different sizes and to different price points. A large motion 
tracking terminal 2 65 might use the whole wall of a room as the 

30 display device 264. This might be achieved by the wall being a 
special opaque screen for rear transmission and the projector in an 
adjacent room projecting the AVE 42 0 onto the screen such that it is 
visible to the users 17 in the meeting room. The width of the table 
172 could be more than 5 metres; the shape could be elliptical on one 

35 side and straight on the display side. Two rows of chairs 174 might 
be provided. A large number of cameras 29 could be situated to track 
a large number of participants 17 sitting in the chairs 174. Each 
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participant in the room could see each other participant. It might 
have a maximum capacity of more than 20 motion tracked people. 



A medium-size motion tracking terminal 265 might use two plasma 
5 screens situated on the end of the table 172 . It might have a maximum 
capacity of 7 motion tracked people. 

A smaller motion tracking terminal 265 might use one monitor on the 
end of the table 172. It might have a maximum capacity of 3 motion 
10 tracked people. A motion tracking terminal 265 could be installed at 
each of the offices of an international organisation. 

User devices and combinations 

Many different types of user device may be used in an avatar user 
15 interface system 261. For multiple users at one location, a motion- 
tracking terminal 265 may be the optimal user device. For a user in 
his office a PC 3 may be the best device; this PC 3 may or may not 
have a webcam 29 to track the movements of the user 17. Whilst on the 
move, a user may use a mobile device such as a wireless Personal 
20 Digital Assistant (PDA) with telephone to participate in a 
communication session. Caves 350, exercise stations 414 and VR 
Headsets 3 66 are other types of user device that may be be used in an 
avatar user interface system 261. 

25 Figure 55 is a block diagram of apparatus for an avatar user interface 
system 261 with multiple user devices. A session server 1, an avatar 
hosting server 4, an avatar agent hosting server 321, a motion- 
tracking terminal 265, a CAVE 350 and a PC 3 are connected together by 
a network 2 . 

30 

At the lowest level of usage, an avatar user interface system 261 may 
be operable with a minimum of one user device and one user 17. In the 
case of one user 17, the user is probably communicating with an avatar 
intelligent agent software unit 320. 

35 

The highest quality usage for the best sense of co-presence is when 
all the users 17 are using motion tracking terminals 265. 
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This invention provides for the reality that users 17 may not all be 
at locations where there are motion tracking terminals 265 available 
and provides for users being connected via a variety of different user 
devices to one session. 

SIXTEENTH EMBODIMENT 

It is a purpose of this sixteenth embodiment that the display device 
264 of the avatar user interface system 261 includes two or more 
projection means. 

AVE and Presentation resolution 

With a single display means such as a computer screen, if the virtual 
presentation screen 53 in the Avatar Virtual Environment (AVE) 42 0 is 
small, then a presentation slide containing words that is projected 
onto the virtual presentation screen 53 will be unreadable. A typical 
computer screen will have 1024 pixels across and this might also be 
the width of a large meeting room media window 50 showing an AVE 420. 
If the virtual presentation screen 53 is in proportion with the whole 
virtual meeting room, then it may only have 200 pixels width. This is 
not enough pixels for resolving the words on a presentation slide. 

The human eye, has great resolving power and a person may read a 
poster on a wall, even if the poster is quite small and the person is 
not close to it. From the same position, the person can also take in 
the whole wall by 'zooming out' . A novel display apparatus in an 
avatar user interface system 261 is disclosed, which takes advantage 
of the capabilities of the human eye to view simultaneously, the AVE 
420 and the presentation screen 53 at full resolution as if they were 
one environment . 

Figure 56 is a schematic of a display device 264 consisting of a 
display screen 430, an AVE projector 431 and a Presentation projector 
432. The meeting room media window 50 is projected by the AVE 
projector .431 . The virtual presentation screen 53 is projected by the 
Presentation projector 432. 



To avoid % whiting out' the virtual presentation screen 53, the same 
area in the AVE is projected black with minimal light leaving the AVE 
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projector 431 to fall on the area of the presentation screen 53. In 
this way, the presentation benefits from the full contrast of the 
Presentation projector 432. Furthermore, the presentation appears 
brighter than the AVE , which is a strong parallel to a real 

5 presentation in a darkened real room, in which the presentation screen 
is usually the brightest element. Projection may be from the 
table top, from a ceiling attachment or in reverse from behind an 
opaque screen. The software director 80 on the PC 3 will generate two 
full-size displays: the AVE and the presentation; 3D graphics cards 

10 already on the market can drive two full-size displays. The display 
screen 430 may be any aspect ratio or it may be curved. 

Dual projector unit 

Figure 57 is a schematic of a display device 264 in which the AVE and 

15 Presentation projection means are combined into one physical unit 433. 
The AVE projection optics 434 has the normal controls available on a 
desktop projector such as focus and perhaps zoom. The axis 43 9 of the 
Presentation projection optics 435 may be altered such that it points 
anywhere within the AVE area 440 projected by the AVE projection lens 

20 434. A slider control 436 can be moved by a user 17 to move the axis 
439 from left to right. A slider control 437 can be moved by a user 
17 to move the axis 43 9 up and down. A slider control 438 can be 
moved by a user 17 to zoom the Presentation area 441 in and out. The 
controls 436-438 may directly move the presentation projection optics 

25 435, or they may drive motors that move the optics. In this way, the 
presentation area 441 can quickly be aligned to the right place in the 
AVE area 440 at the start of the session. During the session, it is 
important that, once set up, the software director 80, does not move 
the pixel position of the virtual presentation screen 30 in the AVE 

30 420. Manual control of the position and size of the axis may be 
achieved by a number of other means such as the use of a remote 
control. A camera 221 built into the projector 443 that images the 
AVE area 440 could be used to locate the projected size/position of 
the Presentation area 441. A control loop could be constructed to set 

35 the presentation projector axis orientation/ zoom automatically using 
software -driven motors driving the presentation projection optics 435. 
The control loop could be driven by the software director 80 from the 
PC 3 which could project reference images from both projectors 
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alternately that are imaged by the camera 221, It is a further 
purpose of this sixteenth embodiment that the projection means is 
provided with alignment means that can be either manual or automatic 
or both. 

Powerful projected conference systems 

Larger conference facilities may require two or more presentation 
screens within the overall AVE display. One presentation screen might 
show a video, a second a presentation slide and a third might show a 
head/ shoulders shot of the avatar of the presenter. Each presentation 
screen might be driven by a different projector. Or, a plurality of 
virtual presentation screens might be arranged in the AVE such that 
they can be driven by one presentation projector 432. In this case, 
the resolution of each virtual presentation screen is half or less. 

PCs are able to generate real-time 3D with more pixels than display 
projectors can project. Two or more AVE projectors 431 could be used 
in a tile formation to project a high-resolution AVE. Alignment means 
permit the projectors to be aligned to each other so that there are no 
gaps and no overlaps. The display screen 430 may be planar- 
rectangular, or it may be curved, or it may comprise a number of 
planes abutting at any angle. Different projectors might be located 
to project onto different planes or curves. 

It is a further purpose of this sixteenth embodiment that any number 
of AVE projectors 431 and any number of Presentation projectors 432, 
whether integrated in units 433 of two or more projectors or not, may 
be used to display any number of virtual presentation screens within 
an AVE on a continuous display screen of any shape or combination of 
shapes . 

Multi -density display device 

Display devices available today usually have a single screen that is 
either illuminated within its unit (such as CRT monitors, LCD 
displays, plasma screens, opto-polymer displays) or comprises a 
separate screen illuminated by projection from another unit (front 
projector, rear projector) . The scope of this sixteenth embodiment is 
not limited to projection devices, but includes single unit devices 
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with two or more areas of display of different pixel densities as 
measured by pixel row and column spacings in units of length. 



Figure 58 is a schematic of a multi-density display device 451 
5 comprising an area of low-density pixels 452 and an embedded area of 
high-density pixels 453. The multi-density display device 451 may be 
packaged in a single unit, which has the advantages of lower 
complexity, lower weight, lower manufacturing and lower installation 
costs for example. Or, it may be packaged as two or more units. The 
10 embedded high-density area 453 may insert into the low-density area 
452 such that the join cannot be seen when the multi-density device is 
in use, or the join may be visible, but not in such a way that it 
impairs the usability of the device. The high-density area 453 could 
be situated anywhere in the low-density area 452. The high-density 
15 area 453 could be central, surrounded on all sides by low-density 
pixels 452, or it could be in an edge or at a corner or as a flap 
along a whole edge. 

The main advantage of a multi -density display device 451 over a 
20 uniformly high-density device is that it will be lower cost to 
manufacture and require less electronics to drive. Most multi -density 
display devices 451 will only have double the number of pixels of a 
conventional display device, instead of possibly nine times for a 
typical application. 

25 

In general use, the multi-density display device 451 is operable such 
that a single image eg a photograph, can be displayed at uniformly low 
resolution across the entire device. There are advantages of pixel 
alignment such that the row and column density of the high-density 

30 area 453 is an integer factor of the low-density area 452. If the 
integer factor is 2 then there will be two rows of pixels in the high- 
density area for each row in the low-density area. The same applies 
for columns. This is shown in the magnified part of Figure 58. In 
this configuration, four high-density pixels 455 may be imaged to be 

35 equivalent to a single low-density pixel 454. A similar 

correspondence applies for other integer factors such as 3 or 4 . It 
is also contemplated in this embodiment that there may be a different 
integer factor for columns than for rows and that there may be a real 
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factor such as 2.5 for rows and a different real factor such as 2.7 
for columns. In the case of real factors, well-known image processing 
techniques may be used such that the display of a single image is not 
impaired. The display illumination intensity of a low-density pixel 
5 454 may not be the same display illumination intensity as a high- 
density pixel 455. After a process of calibration, the image 
processing software will need to compensate for any difference in 
display illumination intensity in applications such as the display of 
a single photographic image. 

10 

In specialist use, such as in an avatar user interface system, the 
multi -density display device 451 may display an Avatar Virtual 
Environment 420 onto the low-density area 452 and a virtual 
presentation screen 53 onto some or all of the high-density area 453. 
15 In specialist use, the display illumination intensity of the low- 
density area 452 may be different from the display illumination 
intensity of the high-density area 4 53 . In the case of displaying 
small text on the high-density area 453, it will be easier to read if 
it has a higher display illumination intensity. 

20 

Multi -density display devices 451 may be manufactured in a variety of 
ways using a variety of technologies such as liquid crystal, plasma 
and opto-polymers . Manufacturing processes will need to be developed 
for the production of multi -density display devices and this is not 
25 expected to be difficult for those skilled in the art. 



It is a further purpose of this sixteenth embodiment that any number 
of low-density areas 452 and any number of high-density areas 453 may 
be combined in any way in a multi-density display devices 451. 

30 

Dual -projector / multi -density display device use 

Dual -projection devices 433 and multi -density display devices 451 are 
useful in communication sessions involving both AVEs and detailed 
information displays. A key advantage is the combination of sense of 
35 presence and the ability to view detailed information such that the 
user has a feeling of being there. A range of devices 431, 432, 433, 
451 may cover needs from one user in a small room to several thousands 
of users in a large conference room. 
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It is a further purpose of this sixteenth embodiment to disclose a 
process wherein a computing appliance means uses a display device 
comprising two projector means comprising the following steps: 
5 - a first projector projects an avatar virtual environment; 
a second projector projects a presentation; 
such that both projections respond to changes independently and at the 
frame rate being used. 

10 SEVENTEENTH EMBODIMENT 

It is a purpose of this seventeenth embodiment that the avatar user 
interface system 261 includes a directional microphone device 460. 

Remote Presentations 

15 As disclosed in this seventeenth embodiment, live presentations can be 
delivered by a remote presenter to a room with an audience using an 
avatar user interface system 261. Furthermore, live presentations can 
be delivered to a mixed audience consisting of an audience physically 
present in a room and a virtual audience simultaneously present at one 

20 or more other locations, connected by a network. During a 

presentation, the presenter's avatar can use media such as slide 
images projected onto a virtual screen. 

In an interactive session with the audience, there are several 
25 problems. The first problem is that of gaze. It is normal for a 
lecturer to address the person in the audience who asked the question. 
But where is that person? In the second problem, that of mixed 
audiences, if the questioner is not in the same room as a viewer, then 
it will be beneficial for the viewer to see a virtual audience. A 
30 third problem is the probability of everyone in a large audience not 
having identifiable avatars and personal microphones. 

Figure 59 is a schematic of an avatar user interface system 261 with a 
mixed audience of avatars 5 of virtual users at various locations and 
35 physical users 17 in an environmental location 273 which is a room 
containing the physical audience and a directional microphone device 
460 that can not only record sound but also the direction from which 
the sound is coming. The directional microphone device 460 is 
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connected to a 'Room PC 3 that is also connected to the room's 
display device 264 and a network 2. An avatar 5 labelled 'Virtual 
Presenter' represents a remote user 17 labelled "Remote Presenter' . 
The remote presenter 17 is using a 'Presenter PC 3 on the network 2. 
5 A physical user 17 labelled 'Questioner' asks a question. The voice 
270 and its direction are picked up by the directional microphone 
device 460 that feeds the information to the PC 3 . The software 
director 80 controls the gaze direction of the virtual presenter 5 to 
face the questioner 17 as the presenter 17 replies. The accuracy of 
10 the gaze direction of the virtual presenter 5 towards the questioner 
17 can be improved by building a virtual model of the environmental 
location 273 including the positions of the display device 2 64 and the 
directional microphone device 460. 

15 In large conference rooms, there are often a number of fixed 
microphones for use by the audience. The accuracy of the gaze 
direction can be further improved by (a) using the directional 
microphone device to identify which fixed microphone is being used and 
(b) use the known location of the fixed microphone in the virtual 

20 model of the environmental location to determine the gaze direction. 

A 'Remote Questioner' 17 is visualised at the environmental location 
273 as a 'Virtual Remote Questioner' 5 displayed on the display device 
264. When the presenter responds to the remote questioner, the 
25 software director knows the positions of both the virtual presenter 
avatar 5 and the virtual remote questioner avatar 5 and can calculate 
the gaze direction. The physical members of the audience at the 
environmental location 273 see the remote presenter answering the 
remote questioner. 

30 

This embodiment is applicable to multiple remote presenters such as a 
presenter and a chairman or a panel of presenters. One or more of the 
presenters may be at the same environmental location 273. Any number 
of environmental locations 273 with two or more users 17 and any 
35 number of environmental locations 273 with one user 17 may be 
connected by a network 2 during a presentation. This embodiment is 
also applicable to the simple case of one remote presenter presenting 
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to one physical audience, in which case there is no virtual remote 
audience . 



It is a purpose of this embodiment to provide means for a remote 
presentation using the avatar user interface system. 

Prepared presentation 

In a live remote presentation, the software director 80 has to 
determine movements for the virtual presenter avatar 5 in real-time. 
Many body and facial gestures are normally timed by skilled presenters 
to fit in with the beginning and end of sentences. This is not 
possible in real-time for the software director 80 because it does not 
know when a sentence is due to begin or end. 

A remote presenter may pre-record his presentation using a microphone 
to record the words as he speaks them. The software director 80 can 
then be used to prepare a better visual avatar presentation than the 
live presentation. This preparation can be done automatically by the 
software director 80 or interactively with the presenter 17. 

Figure 60 is a block diagram of an apparatus for presentation 
preparation. A presentation preparer 461 may be operated either 
automatically or interactively by a user 17 to output a prepared 
presentation 466. At any time later, the prepared presentation 466 
may be played on a player 210. The presentation preparer 461 has a 
set of voice recordings 464 and any associated media elements 465 as 
the main input. Media elements might be slide images, animations, 
audio- video clips, 3D objects, avatar player scenes or any other type 
of media. A prepared presentation 466 is an example of an avatar 
player scene; it may be executed in a linear fashion by a player 210. 
A presentation may be prepared without media elements 465. A 
presentation might also be mimed without voice recordings 464. 

In Automatic preparation, the software director 80 takes a series of 
voice recordings 464 that have been associated with presentation 
media elements 465 such as slide changes and automatically generates 
the complete presentation including but not limited to: movement, 
gestures, gaze and lipsync for avatars; lighting, prop and camera 
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animations, A library of presentation actions 462 and a presentation 
action generator 463 is used for preparing the avatar animation. A 
set of automatic presentation rules is built into the presentation 
preparer 461 which is a finite state machine. 

5 

In manual presentation preparation using the preparation preparer 461, 
the user 17 may select what animations should be used when. Manual 
preparation is based on manually editing event positions on a 
timeline . 

10 

There are several advantages of a prepared presentation: (a) nervous 
presenters can fully prepare their presentations with much less 
stress; (b) scope is removed for the presenter to poorly time his 
presentation and overrun the time slot; (c) the gestural quality and 
15 timing of the prepared presentation is higher; (d) unskilled 
presenters with poor body language need not be embarrassed. 

It is a purpose of this embodiment to provide means for preparing a 
remote presentation using the avatar user interface system. 

20 

Presentation control 

During the remote presentation, either a user 17 may control the mode 
of the software director 80 using mode selection buttons in the avatar 
user interface window 260, or the software director 80 may make a best 
25 guess at the mode. The rules applied to controlling the movement of 
the avatar of the presenter vary with mode. Modes include: 

Playing a prepared presentation 

Live presentation 

Question 
30 - Answer 

Asking for questions 

Applause 

Background murmur between presentations 

35 It is a purpose of this seventeenth embodiment to disclose a process 
wherein wherein directional microphone means and seating plan means 
are used comprising the following steps: 
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a person speaks; 

a directional microphone means records the person's speech and the 
direction the speech is coming from; 

a software director uses the seating plan and the direction that 
the speech is coming from to generate avatar enactments such that 
displayed avatars can gaze in the direction of the speaker. 

FURTHER MODIFICATIONS AND AMENDMENTS 

Although the previous embodiments of the present invention have been 
described in which the personal computer 3 has been used for running 
the avatar user interface 160, it will be appreciated that a wide 
range of computing appliances 3 could be used. It will also be 
appreciated that any network may be used including the internet, a 
corporate intranet, an extranet, a virtual private network, wireless 
networks such as 3G, GSM, home wireless networks and direct 
connections such as ISDN or PSTN telephone. It will further be 
appreciated that when a user is referred to in this disclosure in the 
male form such as 'he' , that this is an inconvenience of language and 
that the meaning is equally applicable to male and female users and 
that the use of this invention is not limited to males but may be used 
in an identical way by females. 
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1. An apparatus for an avatar user interface system comprising: 
5 - server means for serving the communication session; 

one or more computing appliance means; 

network means for joining said server means and said computing 
appliance means; 

avatar means for representing each user visually; and 
10 - avatar user interface application means resident on each computing 
appliance means; 
operable by one or more users . 

2. Apparatus in accordance with claim 1 wherein said avatar means 
15 comprises an identity. 

3. Apparatus in accordance with claim 2 wherein said identity means 
comprises an avatar number. 

20 4. Apparatus in accordance with claim 3 wherein said avatar number 
means comprises an avatar hosting service identity number and an 
avatar identity number. 

5. Apparatus in accordance with any of claims 2 to 4 wherein said 
25 identity means comprises a password. 

6. Apparatus in accordance with any of claims 2 to 5 wherein said 
identity means comprises a display permission. 

30 7 . Apparatus in accordance with any of claims 2 to 6 wherein said 
identity means comprises biometric data. 

8. Apparatus in accordance with any of claims 2 to 7 wherein said 
identity means comprises impersonation parameters. 

35 

9. Apparatus in accordance with claim 8 wherein said impersonation 
parameters means comprise action impersonation parameters. 



127 



WO 03/058518 



PCT/GB03/00031 



10. Apparatus in accordance with claim 9 wherein said action 
impersonation parameters means is generated using an action 
impersonation parameter generator. 

5 

11. Apparatus in accordance with claim 8 wherein said impersonation 
parameters means comprise voice impersonation parameters. 

12 . Apparatus in accordance with any of claims 2 to 11 wherein said 
10 identity means comprises personal data. 

13 . Apparatus in accordance with any of claims 2 to 12 wherein said 
identity means comprises billing data. 

15 14 . Apparatus in accordance with any preceding claim wherein said 
avatar means comprises a 3D avatar. 

15. Apparatus in accordance with claim 14 wherein said 3D avatar 
means comprises a parameter avatar. 

20 

16 . Apparatus in accordance with claim 14 wherein said 3D avatar 
means comprises a photo-realistic avatar. 

17. Apparatus in accordance with any of claims 1 to 13 wherein said 
25 avatar means comprises an animatable image avatar. 

18. Apparatus in accordance with any of claims 1 to 13 wherein said 
avatar means comprises another avatar type. 

30 19. Apparatus in accordance with any preceding claim wherein said 
avatar means is generated using an avatar generator editor. 

20. Apparatus in accordance with claim 19 wherein said avatar 
generator editor means comprises a parameter avatar generator and a 

35 database for parameter avatars . 

21. Apparatus in accordance with claim 19 wherein said avatar 
generator editor means comprises a photo-realistic avatar generator. 
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22. Apparatus in accordance with claim 21 wherein said photo- 
realistic avatar generator means comprises a booth. 

5 23. Apparatus in accordance with claim 21 wherein said photo- 
realistic avatar generator means comprises a camera. 

24. Apparatus in accordance with claim 21 wherein said photo- 
realistic avatar generator means comprises a service. 

10 

25. Apparatus in accordance with any preceding claim wherein said 
network means comprises an IP network. 

26. Apparatus in accordance with any preceding claim wherein said 
15 network means comprises a plurality of networks. 

27. Apparatus in accordance with claim 26 wherein said plurality of 
networks comprises at least one IP network and at least one telephone 
network . 

20 

28. Apparatus in accordance with claim 26 wherein said plurality of 
networks comprises at least one IP network; at least one telephone 
network and at least one mobile phone network. 

25 29. Apparatus in accordance with any preceding claim wherein said 
avatar means is hosted by an avatar hosting server means. 

30. Apparatus in accordance with claim 29 wherein said avatar 
hosting server means comprises memory means for storing avatars. 

30 

31. Apparatus in accordance with any of claims 29 or 30 wherein said 
avatar hosting server means comprises database means. 

32. Apparatus in accordance with any of claims 29 to 31 wherein said 
35 avatar hosting server means comprises avatar hosting server management 

software means . 
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33. Apparatus in accordance with any of claims 29 to 32 wherein said 
avatar hosting server means comprises one or more avatar converter 
software means for converting avatar means from one format to another. 

5 34. Apparatus in accordance with any of claims 4, 2 9 to 3 3 wherein 
said avatar hosting server means is identified by said avatar hosting 
service identity number and an avatar hosting registry server 
connected to said network stores location information as to the 
network location of said avatar hosting server means indexed to said 

10 avatar hosting service identity number operable such that the network 
location of said avatar hosting server means may be retrieved from 
said avatar hosting registry server means by provision of said avatar 
hosting service identity number. 

15 35. Apparatus in accordance with any preceding claim wherein said 
server means for serving the communication session comprises session 
management software . 

36. Apparatus in accordance with any preceding claim wherein said 
20 server means for serving the communication session comprises an event 

accumulator. 

37. Apparatus in accordance with any preceding claim wherein said 
server means for serving the communication session comprises an audio 

25 mixer. 

38 . Apparatus in accordance with any preceding claim wherein said 
server means for serving the communication session comprises a text 
chat engine . 

30 

39. Apparatus in accordance with any preceding claim wherein said 
server means for serving the communication session comprises an e-mail 
engine . 

35 40. Apparatus in accordance with any preceding claim wherein said 
server means for serving the communication session comprises a speech 
recognition engine. 
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41. Apparatus in accordance with any preceding claim wherein said 
server means for serving the communication session comprises a 
translation engine. 

42. Apparatus in accordance with any preceding claim wherein said 
server means for serving the communication session comprises a text to 
speech engine. 

43 . Apparatus in accordance with any preceding claim wherein said 
server means for serving the communication session comprises a 
protocol converter . 

44 . Apparatus in accordance with any preceding claim further 
comprising an avatar agent hosting server. 

45. Apparatus in accordance with claim 44 wherein said avatar agent 
hosting server comprises avatar agent hosting server management 
software . 

46. Apparatus in accordance with claim 44 wherein said avatar agent 
hosting server comprises at least one intelligent agent software unit. 

47. Apparatus in accordance with claim 46 wherein said intelligent 
agent software unit comprises artificial intelligence software and a 
knowledge base. 

48. Apparatus in accordance with claim 46 wherein said intelligent 
agent software unit comprises avatar text to speech software and said 
voice impersonation parameters . 

49. Apparatus in accordance with any of claims 44 to 48 wherein 
there is one computing appliance used by one user and one intelligent 
agent software unit hosted by one intelligent agent hosting server. 

50. Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises display device means. 
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51. Apparatus in accordance with claim 50 wherein said display- 
device means comprise two projector means projecting an avatar virtual 
environment projection and a presentation projection such that the 
presentation projection is significantly smaller than and lies within 
the boundary of the avatar virtual environment projection. 

52. Apparatus in accordance with claim 51 wherein said two 
projectors means are provided in one physical unit. 

53 . Apparatus in accordance with claim 50 wherein said display 
device means comprises a multi-density display device with a high 
density area set into a low-density display. 

54 . Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises lip sync generation means. 

55 . Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises a headset comprising both speaker 
and microphone. 

56. Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises at least one radio transceiver and 
an earpiece with microphone and speaker worn by a user for wireless 
conver sa t ion . 

57. Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises at least one directional 
microphone such that it is possible to identify the direction from 
which the voice of the speaker is coming. 

58. Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises an identity source reader. 

59. Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises a biometric device. 

60. Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises game interface equipment. 
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61. Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises a motion tracking terminal. 

62. Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises an exercise station. 

63 . Apparatus in accordance with any preceding claim wherein said 
computing appliance means comprises a Cave. 

64. Apparatus in accordance with claim 63 wherein said Cave means 
comprises a motion capture system. 

65. Apparatus in accordance with any preceding claim wherein said 
avatar user interface application means comprises an avatar user 
interface window displayed on said display device means. 

66. Apparatus in accordance with claim 65 wherein said avatar user 
interface window means comprises a session user interface window. 

67. Apparatus in accordance with claim 66 wherein said session user 
interface window means comprises a meeting room media window 
controlled by a software director. 

68. Apparatus in accordance with claim 65 wherein said avatar user 
interface window means comprises attendees functionality. 

69. Apparatus in accordance with claim 65 wherein said avatar user 
interface window means comprises switchboard functionality 

70. Apparatus in accordance with claim 65 wherein said avatar user 
interface window means comprises exhibitor functionality. 

71. Apparatus in accordance with claim 65 wherein said avatar user 
interface window means comprises identity functionality. 
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72 . Apparatus in accordance with any preceding claim wherein said 
avatar user interface application means comprises an avatar virtual 
environment displayed on said display device means. 

5 73. Apparatus in accordance with claim 72 wherein said avatar 
virtual environment means comprises a virtual computing appliance. 

74. Apparatus in accordance with any preceding claim further 
comprising a game hosting server. 

10 

75. Apparatus in accordance with any preceding claim further 
comprising a prepared presentation prepared using presentation 
preparer means. 

15 76. Apparatus in accordance with claim 1 wherein there are the same 
number of said computing appliances and said users; each said 
computing appliance is used by one said user; each said computing 
appliance is at a unique physical location such that the display on 
said computing appliance would normally only be clearly visible to its 

20 user and not to any other user; and there is a minimum of two 
computing appliances and two users. 

77. Apparatus in accordance with claim 1 wherein there are less said 
computing appliances than said users and at least one said computing 

25 appliance is shared by a plurality of said users in the same physical 
location as said computing appliance. 

78. Apparatus in accordance with any preceding claim wherein no user 
views his own avatar on the computing appliance he is using. 

30 

79. Apparatus in accordance with claim 1 wherein said server means 
for serving the communication session is the same physical unit as one 
said computing appliance means such that said physical unit contains 
the functions of both said server means and said computing appliance 

35 means . 

80. A method of communication between a plurality of users via an 
avatar user interface system comprising the steps of: 
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joining a plurality of computing appliance means and a server means 
for serving the communications session to start a communication 
session by means of a network; 

viewing the avatars of the users involved in the communication 
5 session on the said plurality of computing appliance means; 

a user first communicating into a computing appliance; 
one or more users receiving the first communication on one or more 
other computing appliances; 

avatars enacting the first communication on said computing 
10 appliances; 

a user responding to the first communication in a second 
c ommun i c a t i on ; 

one or more users receiving the second communication on one or more 
other computing appliances; 
15 - avatars enacting the second communication on said computing 
appliances; 

continuing the exchange of communications until the session is 
finished; and 

terminating the joining of the computing appliance means and the 
20 server means for serving the communications session to terminate 

the communication session. 

81. A method of communicating between at least one user and at least 
one avatar agent via an avatar user interface system comprising the 
25 steps of : 

joining one or more computing appliance means, an avatar agent 
hosting server means hosting one or more intelligent agent software 
units and a server means for serving the communications session to 
start a communication session by means of a network; 
30 - viewing the avatars of the said avatar agents and said users 
involved in the communication session on the said computing 
appliance means; 

a user or an avatar agent first communicating; 

if there are one or more users who did not first communicate, then 
35 the one or more users who did not first communicate receive the 

first communication on one or more other computing appliances; 
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avatars enacting the first communication on said computing 
appliances; 

if there are one or more avatar agents who did not first 
communicate, then the one or more avatar agents who did not first 
5 communicate receive the first communication; 

a user or an avatar agent responding to the first communication in 
a second communication; 

one or more users or one or more avatar agents receiving the second 
communi cat ion ; 

10 - if there are one or more avatars receiving the second 
communication, then avatars enact the second communication on said 
computing appliances; 

continuing the exchange of communications until the session is 
finished; and 

15 - terminating the joining of the computing appliance means, the 
avatar agent hosting server means and the server means for serving 
the communications session to terminate the communication session. 

82. A method in accordance with any of claims 80 or 81 wherein said 
20 viewing step each computing appliance means is viewed by one user. 

83 . A method in accordance with any of claims 80 or 81 wherein said 
viewing step at least one said computing appliance is viewed by a 
plurality of said users in the same physical location as said 
25 computing appliance. 

84 . A method in accordance with any of claims 80 or 81 wherein said 
viewing step each user cannot view his own avatar on the computing 
appliance he is using. 

30 

85. A method in accordance with any of claims 80 or 81 wherein said 
steps of communicating and receiving a communication take place in 
parallel with a small time delay that is acceptable to the user. 

35 86. A method in accordance with any of claims 80 or 81 wherein said 
step of enacting a communication comprises both audio and visual 
output such that movements of said avatar observed visually including 
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lip movements in a speaking avatar are synchronised with the audio 
voice . 



87. A method in accordance with any of claims 80 or 81 wherein said 
5 joining step further comprises means for identifying a user with an 

avatar . 

88. A method in accordance with claim 87 wherein said viewing step 
further comprises means for receiving an avatar of a user at said 

10 computing appliance . 

89. A method in accordance with claim 88 wherein said avatar is 
received from an avatar hosting service. 

15 90. A method in accordance with claim 89 wherein said avatar is 
first converted by avatar converter software at said avatar hosting 
service from one format to another. 

91. A method in accordance with any of claims 80 to 90 wherein said 
20 joining step further comprises the following steps: 

user providing an avatar number and password; 

said computing appliance sends said avatar number and said password 
to the network location of an avatar hosting service; 
avatar hosting server management software on said avatar hosting 
25 service checks a database to verify that said avatar number and 

said password are valid; 

if said avatar number and said password are valid, then avatar 
hosting server management software on said avatar hosting service 
sends said avatar to said computing appliance* 

30 

92. A method in accordance with any of claims 80 to 90 wherein said 
joining step, an avatar number comprises an avatar hosting service 
identity number and an avatar identity number, and said joining step 
further comprises the following steps: 

35 - user providing an avatar number and password; 

said computing appliance sends an avatar hosting service identity 
number to an avatar hosting registry server ; 
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said avatar hosting registry server sends to said computing 
appliance the network location of the avatar hosting service 
corresponding to said avatar hosting service identity number; 
said computing appliance sends said avatar number and said password 
5 to the network location of said avatar hosting service; 

avatar hosting server management software on said avatar hosting 
service checks a database to verify that said avatar number and 
said password are valid; 

if said avatar number and said password are valid, then avatar 
10 hosting server management software on said avatar hosting service 

sends said avatar to said computing appliance, 

93 . A method in accordance with claim 80 wherein said enacting step 
a viewing user at a computing appliance sees an avatar virtual 
15 environment with avatars of other users at other computing appliances 
that are photo-realistic and that move anima-realistically , that 
substantially gives said viewing user the impression of the other 
users being together in one virtual location. 

20 94. A method in accordance with claim 81 wherein said enacting step 
a viewing user at a computing appliance sees an avatar virtual 
environment, with any avatars of other users at other computing 
appliances and any avatars of avatar agents, that are photo-realistic 
and that move anima-realistically, that substantially gives said 

25 viewing user the impression of any other users and any avatar agents 
being together in one virtual location. 

95. A method in accordance with any of claims 80 to 91 wherein said 
enacting steps, software director means drive avatar engine means to 

30 generate said enactment . 

96. A method in accordance with claim 87 wherein said joining step, 
said avatar means comprises an identity that further comprises a 
display permission, and said joining step further comprises means for 

35 checking that said display permission permits display of said avatar 
on computing appliance means . 
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97. A method in accordance with claims 80 or 81 wherein said joining 
step, said avatar means comprises an identity and said joining step 
further comprises the following steps: 

5 - said person providing an identity source that is read by an 
identity source reader ; 

retrieving the avatar whose identity matches the identity in said 
identity sources- 
displaying said avatar; 
10 - a security user visually comparing said avatar with said person. 

98. A method in accordance with claims 80 or 81 wherein said joining 
step, said avatar means comprises an identity that further comprises 
avatar biometric data, and said joining step further comprises the 

15 following steps: 

said person providing an identity source that is read by an 
identity source readers- 
retrieving the avatar whose identity matches the identity in said 
identity source; 

20 - extracting said avatar biometric data from said avatar; 

a biometric device scanning part of said person to provide scanned 
biometric data; 

comparing said scanned biometric data with said avatar biometric 
data; 

25 - if said scanned biometric data does not match said avatar biometric 
data then alerting the security user; 

displaying said avatar to said alerted security users- 
said alerted security user visually comparing said avatar with said 
person. 

30 

99. A method in accordance with any of claims 80 or 81 wherein sound 
passes through the avatar user interface system comprising the 
following steps: 

a microphone means records sound from a user of a computing 
35 appliance means as said user speaks; 
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a lip synchronisation generator means on said computing appliance 
means processes said sound to provide a combined audio and 
geometric position stream; 

the computing appliance means streams said combined audio and 
5 geometric position stream over the network to an audio mixer; 

said audio mixer mixes said combined audio and geometric position 
stream with any other combined audio and geometric position streams 
to produce a specific mixed audio and geometric position stream for 
each computing appliance; 
10 - said audio mixer sends each computing appliance its specific mixed 
audio and geometric position stream; 

said computing appliance plays said specific mixed audio and 
geometric position stream to its user via a loudspeaker means. 

15 100. A method in accordance with claim 99 wherein said lip 
synchronisation generator process comprises a process performed at 
regular intervals on a digital audio stream flowing into a buffer of 
the following steps: 

the contents of the buffer are copied and then the buffer is 

20 emptied; 

a discrete fourier transform is performed on the copied contents of 
the buffer and a spectrum is output; 

one or more analysers analyse the output spectrum and each analyser 
outputs a value representing a geometric position of a part of a 
25 talking head. 

101. A method in accordance with claim 100 wherein the sequence of 
audio spectrums is combined with the sequence of geometric positions 
for transmission over the network. 

30 

102. A method in accordance with claim 101 wherein there are 
compression and decompression steps. 

103. A method in accordance with claim 95 wherein said software 
35 generator uses personal action impersonation parameters defined for an 

avatar to generate animations for said avatar such that said avatar 
moves recognisably like the person it represents. 



140 



WO 03/058518 



PCT/GB03/00031 



104. A method in accordance with claim 95 wherein said software 
generator uses generic action impersonation parameters defined for a 
communication context such that avatars move in ways believable within 

5 that communication context. 

105. A method in accordance with claim 104 wherein said generic 
action impersonation parameters are defined for said communication 
context comprising the following steps: 

10 - recording a corpus of videos of said communication context; 

processing said corpus by a trained person along a timeline to 
produce an annotated timeline with actions of each communication 
context participant related to a number of parameters; 
analysing said annotated timeline by a trained person to produce a 

15 type definition of each action impersonation parameter and a set of 

rules that can be incorporated into a finite state machine for said 
communication context. 

106. A method in accordance with claim 103 wherein said personal 
20 action impersonation parameters for a particular person are generated 

using an action impersonation generator/editor means involving manual 
input by a user comprising the following steps: 

in the first step, said user makes selections from a number of sets 
of generic action impersonation parameters at a high level; 
25 - in the second step, said user edits said selections at a lower 
level; 

wherein said second step is optional and said user may or may not be 
the person for whom the personal action impersonation parameters are 
generated. 

30 

107. A method in accordance with claim 103 wherein said personal 
action impersonation parameters for a particular person are generated 
automatically using an action impersonation generator /editor means 
comprising the following steps: 

35 - in the first step, video recordings are made of said person 
carrying out a number of defined actions; 

in the second step, the action impersonation generator/editor 
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automatically analyses the video recordings to generate a set of 
personal action impersonation parameters. 



108. A method in accordance with claim 95 wherein said software 
5 director uses voice impersonation parameters defined for an avatar to 

generate speech from text using text to speech engine means for said 
avatar such that said avatar speaks recognisably like the person it 
represents comprising the following steps: 

intelligent agent software unit means generates said text; 
10 - text to speech engine means converts said text to speech; 
said speech is played on said computing appliance. 

109. A method in accordance with claim 108 wherein said voice 
impersonation parameters are defined for said avatar of a particular 

15 person comprising the following steps: 

recording said person speaking predefined text; 

processing said recording using impersonation parameter generation 
software ; 

said impersonation parameter generation software outputting said 
20 voice impersonation parameters for that person; 

storing said voice impersonation parameters in said avatar. 

110. A method in accordance with claim 108 wherein the person who is 
being impersonated is known to the user such that the avatar 

25 impersonating said person speaks and moves recognisably like said 
person. 

111. A method in accordance with any of claims 80 to 109 wherein said 
avatar means comprises a 3D avatar. 

30 

112. A method in accordance with any of claims 80 to 109 wherein said 
avatar means comprises a parameter avatar. 

113. A method in accordance with any of claims 80 to 109 wherein said 
35 avatar means comprises an animatable image avatar. 
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114. A method in accordance with any of claims 80 to 81 wherein said 
avatar means is first generated using an avatar generator editor. 



115. A method in accordance with claim 114 wherein said avatar 
5 generator editor means comprises a booth. 

116. A method in accordance with claim 114 wherein said avatar 
generator editor means comprises a camera. 

10 117. A method in accordance with claim 114 wherein said avatar 
generator editor means comprises a service. 

118. A method in accordance with claim 81 wherein after a voice 
communication by a user, a speech recognition engine means processes 

15 the voice communication comprising the following steps: 
a user generates a voice communication by speaking; 

a speech recognition means processes the voice communication and 
outputs text ; 

the text is sent to any intelligent agent software units involved 
20 in the session. 

119. A method in accordance with claim 118 wherein a user speaks in a 
first language and an intelligent agent software unit operates in a 
second language such that text is translated by translation engine 

25 means comprising the following steps: 

a user generates a voice communication by speaking in a first 
language ; 

a speech recognition means that operates in said first language 
processes the voice communication in said first language and 
30 outputs text in said first language; 

the text in said first language is translated by translation engine 
means into text in a second language; 

text in said second language is sent to any intelligent agent 
software units involved in the session capable of processing text 
35 in said second language. 
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120. A method in accordance with any of claims 95 to 118 wherein a 
user understands a first language and an intelligent agent software 
unit operates in a second language such that text is translated by 
translation engine means comprising the following steps: 

5 - an intelligent agent software units generates text in a first 
language ; 

the text in said first language is translated by translation engine 
means into text in a second language; 

text to speech engine means converts said text in said second 
10 language to speech in said second language; 

said speech in said second language is played to said user using 
loudspeaker means . 

121. A method in accordance with any of claims 80 to 120 wherein said 
computing appliance means uses a display device comprising two 
projector means comprising the following steps: 

a first projector projects an avatar virtual environment; 
a second projector projects a presentation; 
such that both projections respond to changes independently and at the 
frame rate being used. 

122. A method in accordance with claim 95 wherein directional 
microphone means and seating plan means are used comprising the 
following steps: 

25 - a person speaks; 

said directional microphone means records said person's speech and 
the direction said speech is coming from; 

said software director uses said seating plan and said direction 
that said speech is coming from to generate avatar enactments such 
30 that displayed avatars can gaze in the direction of the speaker. 

123. A method in accordance with any of claims 80 or 81 wherein users 
communicate whilst exercising on exercise station means comprising the 
following steps: 

a first user using a first exercise station means; 
35 - a second user using a second exercise station means; 

said first user viewing the avatar of said second user using a 
virtual exercise station; 
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said second user viewing the avatar of said first user using a 
virtual exercise station ; 

said first and second users communicating by voice; 

optionally said first and second users viewing performance data 
5 generated by said first and second exercise station means; 

optionally any user being able to see if the other user has stopped 
exercising. 

124. A method in accordance with any of claims 80 or 81 wherein users 
10 are present in Cave means with motion capture systems means comprising 

the following steps: 

said motion capture system means records movements of a first user 
in a first Cave means; 

said recorded movements are sent with acceptable lag from said 
15 first Cave means to a second Cave means; 

an avatar of said first user is displayed in said second Cave means 
such that the movements of said avatar duplicate the movements of 
said user in space; 

a second user wearing shutter glasses or similar immersive 3D 
20 viewing means in said second Cave means views the movements of the 

avatar of said first user as if said first user were physically in 
said second Cave with said second user. 

125. A method in accordance with any of claims 80 or 81 wherein users 
25 communicate in virtual exhibition means comprising the following 

steps : 

a user navigates in a virtual exhibition stand of a company ; 

said user views and interacts with virtual objects representing 

products; 

30 - optionally said user communicates remotely with a real sales 
representative ; 

optionally said user communicates with an intelligent agent avatar; 
optionally said user views presentations; 
optionally said user buys said product. 

35 
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126. A method in accordance with any of claims 80 or 81 wherein users 
communicate in an interactive game hosted on a game hosting server 
comprising the following steps: 

a first user interacts with the game, navigates around the 3D game 
5 scene and views the avatar of a second user; 

said second user interacts with said game, navigates around said 3D 
game scene and views the avatar of said first user; 
said first user communicates by speaking; 

said second user hears said first user and views the avatar of said 
10 first user in lip synchronisation with said first user's speech; 

said second user communicates by speaking; 

said first user hears said second user and views the avatar of said 
second user in lip synchronisation with said second user's speech. 

15 127. A method in accordance with any of claims 80 or 81 wherein a 
remote presenting user presents a presentation remotely comprising the 
following steps: 

said remote presenting user starts a prepared presentation; 

remote audience users watch the avatar of said remote presenting 

20 user perform said prepared presentations- 

present audience users present physically together in a theatre 
watch a projection of the avatar of said remote presenting user 
perform said prepared presentation; 
said prepared presentation ends; 

25 - a remote audience user asks a questions- 
said remote presenting user views the avatar of said remote 
audience user asking the question from amongst a single virtual 
audience and said avatar of said remote audience user gazes at said 
remote presenting user; 

30 - said present audience users view the avatar of said remote audience 
user asking the question from amongst a single virtual audience 
around the avatar of said remote presenting user and said avatar of 
said remote audience user gazes at said avatar of said remote 
presenting user. 

35 
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128. A method in accordance with claim 127 wherein a prior step, said 
remote presenting user prepares a presentation using presentation 
prepare means . 
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